Q1 –

You are given the following training dataset, made of pairs (feature, label), Dtrain = {([0 −2], −1),

([0 2], 1), ([1 0], −1)}. Without setting up and solving any learning or optimization problem,

answer the following questions:

1. Provide a plot of the two-dimensional feature space where you have marked the three features

that you have been provided.

2. On the same plot, draw the decision boundary that maximizes the margin for each of the

features.

3. Provide a weight vector that defines the decision boundary that you have drawn before.

4. Provide the geometric margin for each of the three features.

5. Does the linear predictor defined by the weight vector above perform a perfect classification

(meaning without making any classification error) on the features of Dtrain?

Q2-

Address the following questions:

1. What is the motivation for using the hinge loss, as opposed to the 0-1 loss, in a binary

classification problem?

2. Why would you use the logistic loss rather than the hinge loss in a classification problem?

3. In a regression problem, explain how would you expect your learning outcome to change when

you use the absolute deviation loss, as opposed to using the squared loss.

Q3-

You are running a gradient descent optimization to learn the parameters of your model, and after

a few iterations you notice that the values of the parameters being estimated are oscillating.

1. Explain what is happening.

2. What would you suggest doing to complete your optimization procedure?

3. Explain in what situation it makes sense to switch from gradient descent optimization to a

stochastic gradient descent optimization

Q4-

Consider the following hypothesis class F = {x 7→ w1 w2x

2

: [w1 w2] ∈ R

2}, where x ∈ R is the

input datum.

1. Provide a hypothesis class that is less expressive than F.

2. Provide a hypothesis class that is more expressive than F.

3. Provide a hypothesis class that is disjoint from F.

Q5-

Consider the squared loss Loss(x, y, w) = (y − max{w · φ(x), 0})

2

1. Draw the computational graph of the function Loss(x, y, w).

2. Number the internal nodes, 1, 2, . . ., and next to every node i in the graph, indicate the

forward value with fi

, and indicate the backword value with gi

. Also, on the edges of the

graph indicate the corresponding derivatives, and use the forward values as appropriate to

do so. Provide the expression of the gi

’s as function of other backword and edge values, as

appropriate.

3. Assume that w = [1 2], φ = [−1 1], and y = 3. Compute all the forward values, effectively

performing a forward pass.

4. Using the forward values computed previously, compute all the backward values, effectively

performing a backword pass. In particular, compute also the quantity ∂Loss

∂w

.