Description
Problem 1 (40 points)
In this problem, you will implement the Online Perceptron algorithm with an online-to-batch conversion, and
evaluate it on a classification task with a few different data representations.
Restaurant review data set
Download the review data set reviews_tr.csv (training data) and reviews_te.csv (test data) from
Courseworks. This data set is comprised of reviews of restaurants in Pittsburgh; the label indicates whether
or not the reviewer-assigned rating is at least four (on a five-point scale). The data are in CSV format (where
the first line is the header); the first column is the label (label; 0 or 1), and the second column is the review
text (text). The text has been processed to remove non-alphanumeric symbols and capitalization.
Data representations
In this problem, you will experiment with the following four different data representations.
1. Unigram representation (i.e., term frequency).
In this representation, there is a feature for every word t, and the feature value associated with a word
t in a document d is
tf(t; d) := number of times word t appears in document d.
2. Term frequency-inverse document frequency (tf-idf).
This is like the unigram representation, except the feature associated with a word t in a document d
from a collection of documents D (e.g., training data) is
tf(t; d) × log10(idf(t; D)),
where tf(t; d) is as defined above, and
idf(t; D) := |D|
number of documents in D that contain word t
.
This representation puts more emphasis on rare words and less emphasis on common words. (There are
many variants of tf-idf that are unfortunately all referred to by the same name.)
Note: When you apply this representation to a new document (e.g., a document in the test set), you
should still use the idf defined with respect to D. This, however, becomes problematic if a word t appears
in a new document but did not appear in any document in D: in this case, idf(t; D) = |D|/0 = ∞. It
is not obvious what should be done in these cases. For this homework assignment, simply ignore words
t that do not appear in any document in D.
3. Bigram representation.
In addition to the unigram features, there is a feature for every pair of words (t1, t2) (called a bigram),
and the feature value associated with a bigram (t1, t2) in a given document d is
tf((t1, t2); d) := number of times bigram (t1, t2) appears consecutively in document d.
In the sequence of words “a rose is a rose”, the bigrams that appear are: (a,rose), which appears twice;
(rose, is); and (is, a).
4. One more data representation of your own choosing (e.g., trigrams, skip-grams). It should be non-trivially
different from the representations above (e.g., not just bigrams plus a few extra features).
In all four of these representations, include an “intercept” feature whose value is always equal to one (as in
affine expansion).
2
Online Perceptron with online-to-batch conversion
Implement the Online Perceptron algorithm with the following online-to-batch conversion process (similar to
one suggested in lecture):
1. Run Online Perceptron to make two passes through the training data. Before each pass, randomly
shuffle the order of the training examples. Note that a total of 2n + 1 linear classifiers wˆ1, . . . , wˆ2n+1
are created during the run of the algorithm, where n is the number of training examples.
2. Return the linear classifier wˆfinal given by the simple average of the final n + 1 linear classifiers:
wˆfinal :=
1
n + 1
2
Xn+1
i=n+1
wˆi
.
Of course, you should not use (or even look at the source code for) any existing implementation of Perceptron
or Online Perceptron; that would defeat the purpose of this assignment. You can, however, use existing
library functions for basic operations needed in Online Perceptron (e.g., vector inner products), and also for
loading the CSV files and creating the desired data representations. (You are responsible for determining
whether or not an existing library function correctly implements the desired data representation.) As always,
provide proper citations for any software packages you use.
Recall that as Online Perceptron is making its pass through the training examples, it only updates its weight
vector in rounds in which it makes a prediction error. So, for example, if wˆ1 does not make a mistake in
classifying the first training example, then wˆ2 = wˆ1. Use this fact to keep the memory usage of your code
relatively modest.
You will be using this learning algorithm with the restaurant review data represented in four ways described
above. The total number of features with each representation can be very large. However, because each
review has only a relatively small numbers of words, it should still be possible to compactly represent the
data. Use this fact to keep the running time of your code close to linear in the size of the training data.
Do the following with each of the four data representations above. Run your code on the training data to
learn a linear classifier wˆfinal. Compute the training error rate of wˆfinal on this training data, and compute
the test error rate of wˆfinal on the test data.
Post-hoc analysis
Now consider the classifier based on the unigram representation. To get a sense as to its behavior, determine
the 10 words that have the highest (i.e., most positive) weights; also determine the 10 words that have the
lowest (i.e., most negative) weights.
Pick two training examples that this classifier (based on unigram representation) incorrectly classifies. For
each of these examples x = (x1, . . . , xd), identify the 10 words with the highest (i.e., most positive) wixi
value, and also the 10 words with the lowest (i.e., most negative) wixi value. Attempt to explain why the
classifier incorrectly classifies these examples.
What to submit in your write-up:
(a) A precise specification of your fourth chosen data representation. Briefly explain why you chose this
representation. Provide proper citations of any sources you used to come up with this representation.
(b) Training error rates of the linear classifiers with all four data representations.
(c) Test error rates of the linear classifiers with all four data representations.
(d) The lists of words with highest/lowest weights, described above. (Sort each list alphabetically.)
(e) The text of two misclassified examples, with associated lists of words, and explanations for why the
examples were misclassified.
Please submit your source code on Courseworks.
3
Problem 2 (30 points)
In this problem, you’ll analyze an algorithm that finds a large margin linear separator whenever one exists.
Recall that the Perceptron algorithm quickly finds a linear separator for a data set S whenever there exists a
large margin linear separator for S. But the linear separator returned by Perceptron is not necessarily one
that achieves a large margin on S.
An alternative is the following algorithm, which we call Margin Perceptron. This algorithm, like Perceptron,
takes as input a collection S of labeled examples from R
d × {−1, +1}).
• Begin with wˆ1 := 0 ∈ R
d
.
• For t = 1, 2, . . .:
– If there is a labeled example in S (call it (xt, yt)) such that ytwˆ
>
t xt < 1, then set wˆt+1 := wˆt+ηytxt.
– Else, return wˆt.
Above, there are two differences relative to the original Perceptron algorithm. The first is the criteria under
which an example in S is used to perform an update: ytwˆ
>
t xt < 1 (rather than ≤ 0). The second is the
update itself: wˆt+1 := ˆwt + ηytxt. Here, η > 0 is a parameter of the algorithm, which we’ll assume is set as
η :=
1
L2
where L := max
(x,y)∈S
kxk2.
If Margin Perceptron terminates, it returns wˆt satisfying
ywˆ
>
t x ≥ 1 for all (x, y) ∈ S.
But this fact alone does not mean that wˆt achieves a large margin on S. After all, if w is any linear separator
satisfying yw>x > 0 for all (x, y) ∈ S, we can construct another w˜ satisfying yw˜
>x ≥ 1 for all (x, y) ∈ S,
simply by letting w˜ be a particular scaling of w.
(a) Assume that there exists a vector w? ∈ R
d
such that
min
(x,y)∈S
yw>
? x = 1.
Mimic the proof of the Perceptron convergence theorem to prove that Margin Perceptron halts after at
most 3kw?k
2
2L
2
loop iterations. Clearly explain every step of the proof, especially where it differs from
that of the original Perceptron convergence theorem.
(b) Under the same assumption as in Part (a), prove that Margin Perceptron returns wˆt satisfying
kwˆtk2 ≤ 3kw?k2.
Hint: Use the claim you proved in Part (a) (and, possibly, part of its proof).
(c) Very briefly explain why this means that the margin achieved by wˆt is (almost) as large as the margin
achieved by w?.
(d) (Optional.) For any given > 0, explain how to modify Margin Perceptron so that, under the same
conditions as in the previous parts, it returns wˆt satisfying
kwˆtk2 ≤ (2 + )kw?k2.
4
Problem 3 (30 points)
In this problem, you’ll use Lagrange duality to derive the “dual” forms of ridge regression and a variant of
the soft-margin SVM problem.
The ridge regression optimization problem can be written in a form that is reminiscent of the soft-margin
SVM problem (for C > 0):
min
w∈Rd,ξ∈Rn
1
2
kwk
2
2 +
C
2
Xn
i=1
ξ
2
i
s.t. ξi = yi − x
>
i w for all i = 1, . . . , n.
Above, ξ = (ξ1, . . . , ξn) ∈ R
n has a similar role as the slack variables in soft-margin SVM, but note that they
are not required to be non-negative here.
The Lagrangian function associated with this optimization problem is
L(w, ξ, α) = 1
2
kwk
2
2 +
C
2
Xn
i=1
ξ
2
i +
Xn
i=1
αi(yi − x
>
i w − ξi).
Here, α = (α1, . . . , αn) ∈ R
n are the Lagrange multipliers, which are unconstrained.
(a) Fix α ∈ R
n. Derive expressions for the w and ξ that, together, minimize L(w, ξ, α). The expressions
should be given in terms of α (as well as the data and C). Briefly justify each step of your derivations.
(b) The dual objective function g : R
n → R is equal to the Lagrangian function evaluated at wα, ξα, α,
where wα and ξα are the values of w and ξ from Part (a) that minimize L:
g(α) = L(wα, ξα, α) = min
w∈Rd,ξ∈Rn
L(w, ξ, α).
Derive an expression for the α that maximizes g(α). The expression should be given in terms of the
data and C. Moreover, the expression should only involve the xi
’s through the matrix K ∈ R
n×n whose
(i, j)-th entry is x
>
i xj . Briefly justify each step of your derivation.
(c) (Optional.) A variant of the soft-margin SVM problem that is similar to ridge regression is the following
optimization problem (for C > 0):
min
w∈Rd,ξ∈Rn
1
2
kwk
2
2 +
C
2
Xn
i=1
ξ
2
i
s.t. yix
>
i w ≥ 1 − ξi for all i = 1, . . . , n.
This is almost the same as soft-margin SVM, except the slack variables are squared in the objective.
The Lagrangian is
L(w, ξ, α) = 1
2
kwk
2
2 +
C
2
Xn
i=1
ξ
2
i +
Xn
i=1
αi(1 − yix
>
i w − ξi).
However, since the constraints are inequality constraints, the associated Lagrange multipliers α ∈ R
n
are constrained to be non-negative.
Derive expressions for the w and ξ that, together, minimize L(w, ξ, α), and then derive an expression
for the dual objective g(α) = minw∈Rd,ξ∈Rn L(w, ξ, α). Explain why the dual objective only depends on
the xi
’s through the matrix K ∈ R
n×n whose (i, j)-th entry is x
>
i xj .
5