Homework 4

Implementation of SVM via Gradient Descent (30 points)

Here, you will implement the soft margin SVM using different gradient descent methods as described in the section 12.3.4 of the textbook. To recap, given a dataset of \(n\) samples \(\mathcal{D} = \left\{\left(\bm{x}^{(i)}, y^{(i)}\right)\right\}_{i=1}^n\), where every \(d\)-dimensional feature vector \(\bm{x}^{(i)} \in \mathbb{R}^d\) is associated with a label \(y^{(i)} \in \{-1,1\}\), to estimate the parameters \(\bm{\theta} = (\bm{w}, b)\) of the soft margin SVM, we can minimize the loss function: \[\begin{aligned} f(\bm{w},b; \mathcal{D}) &= \frac{1}{2} \|\bm{w}\|_2^2 + C \sum_{(\bm{x}^{(i)}, y^{(i)}) \in \mathcal{D}} \max\left\{0, 1 - y^{(i)}( \bm{w}\cdot \bm{x}^{(i)} + b )\right\} \\ &= \frac{1}{2} \|\bm{w}\|_2^2 + C \sum_{(\bm{x}^{(i)}, y^{(i)}) \in \mathcal{D}} L(\bm{x}^{(i)}, y^{(i)}; \bm{\theta}) \end{aligned}\]

In order to minimize the function, we first obtain the gradient with respect to \(\bm{\theta}\). The partial derivative with respect to \(w_j\), the \(j\)-th entry in the vector \(\bm{w}\), is:

\[\begin{aligned} \label{eq:grad_w} \partial_{w_j} f(\bm{w},b; \mathcal{D}) = \frac{\partial f(\bm{w},b; \mathcal{D})}{\partial w_j} = w_j + C \sum_{(\bm{x}^{(i)}, y^{(i)}) \in \mathcal{D}} \frac{\partial L(\bm{x}^{(i)}, y^{(i)}; \bm{\theta})}{\partial w_j} \end{aligned}\] where \[\begin{aligned} \frac{\partial L(\bm{x}^{(i)}, y^{(i)}; \bm{\theta})}{\partial w_j} = \left\{\begin{array}{cl} 0 & \text{if}\ y^{(i)}\left(\bm{w} \cdot \bm{x}^{(i)} + b \right) \ge 1 \\ -y^{(i)}x_j^{(i)} & \text{otherwise.} \end{array}\right. \end{aligned}\]

and the partial derivative with respect to \(b\) is: \[\begin{aligned} \label{eq:grad_b} \partial_b f(\bm{w},b;\mathcal{D}) = \frac{\partial f(\bm{w},b;\mathcal{D})}{\partial b} = C \sum_{(\bm{x}^{(i)}, y^{(i)}) \in \mathcal{D}} \frac{\partial L(\bm{x}^{(i)}, y^{(i)}; \bm{\theta})}{\partial b} \end{aligned}\] where \[\begin{aligned} \frac{\partial L(\bm{x}^{(i)}, y^{(i)}; \bm{\theta})}{\partial b} = \left\{\begin{array}{cl} 0 & \text{if}\ y^{(i)}\left(\bm{w} \cdot \bm{x}^{(i)} + b \right) \ge 1 \\ -y^{(i)} & \text{otherwise.} \end{array}\right. \end{aligned}\] Since the direction of the gradient is the direction of steepest ascent of the loss function, gradient descent proceeds by iteratively taking small steps along the direction opposite to the direction of gradient. The general framework of gradient descent is given in Algorithm [alg:svm].

Parameters: learning rate \(\eta\), batch size \(\beta\).

Randomly shuffle the training data \(k \gets 0\) \(B \gets \left\{\left(x^{(i)}, y^{(i)}\right) : \beta k+1 \leq i \leq \min\{\beta(k+1), n\}\right\}\) \(w_j^{(t)} \gets w_j^{(t-1)} - \eta \cdot \partial_{w_j} f(\bm{w}^{(t-1)},b^{(t-1)}; B)\) \(b^{(t)} \gets b^{(t-1)} - \eta \cdot \partial_{b} f(\bm{w}^{(t-1)}, b^{(t-1)}; B)\) \(k \leftarrow (k + 1 \mod \lceil n/\beta \rceil)\) break

Task: Implement the SVM algorithm using the following gradient descent variants.

For all the variants use \({C = 100}\), \(\bm{w}^{(0)} = \bm{0}\), \(b^{(0)} = 0\). For all other parameters, use the values specified in the description of the variant.

Note: update the parameters \(\bm{w}\) and \(b\) on iteration \(t\) using the values computed on iteration \(t-1\). Do not update using values computed in the current iteration!

Batch Gradient Descent (BGD): When the \(\beta = n\), in every iteration the algorithm uses the entire dataset to compute the gradient and update the parameters.

As a convergence criterion for batch gradient descent we will use \(\Delta_{\% loss}^{(t)} < \varepsilon\), where \[\begin{aligned} \Delta_{\% loss }^{(t)} = \frac{|f(\bm{w}^{(t-1)}, b^{(t-1)}; \mathcal{D}) - f(\bm{w}^{(t)}, b^{(t)}; \mathcal{D})|}{f(\bm{w}^{(t-1)}, b^{(t-1)}; \mathcal{D})}\times100 \label{eq:stop} \end{aligned}\]

Set \(\eta = 3\cdot10^{-7}\), \(\varepsilon = 0.25\).
Stochastic Gradient Descent (SGD): When \(\beta = 1\), in every iteration the algorithm uses one training sample at a time to compute the gradient and update the parameters.

As a convergence criterion for stochastic gradient descent we will use \(\Delta_{loss}^{(t)} < \varepsilon\), where \[\begin{aligned} \Delta_{loss}^{(t)} = \tfrac{1}{2}\Delta_{loss}^{(t-1)} + \tfrac{1}{2}\Delta_{\% loss}^{(t)}, \label{eq:svm-loss} \end{aligned}\] \(t\) is the iteration number, \(\Delta_{\% loss}^{(t)}\) is same as above (equation [eq:stop]) and and \(\Delta_{loss}^{(0)} = 0\).

Use \(\eta = 0.0001\), \(\varepsilon = 0.001\).
Mini-Batch Gradient Descent (MBGD): In every iteration the algorithm uses mini-batches of \(\beta\) samples to compute the gradient and update the parameters.

As a convergence criterion for mini-batch gradient descent we will use \(\Delta_{loss}^{(t)} < \varepsilon\), where \(\Delta_{loss}^{(t)}\) is the same as above (equation [eq:svm-loss]) and \(\Delta_{loss}^{(0)} = 0\)

Use \(\eta = 10^{-5}\), \(\varepsilon = 0.01\) and \(\beta = 20\).

Task: Run your implementation on the data set in svm/data. The data set contains the following files :

features.txt : Each line contains the features (comma-separated values) of a single sample. It has 6414 samples (rows) and 122 features (columns).
target.txt : Each line contains the target variable (y = -1 or 1) for the corresponding row in features.txt.

Task: Plot the value of the loss function \(f(\bm{w}^{(t)},b^{(t)}; \mathcal{D})\) vs. the iteration number \(t\) starting from \(t=0\). Report the total time (wall clock time, as opposed to the number of iterations) each of the gradient descent variants takes to converge. What do you infer from the plots and the time for convergence?

The diagram should have graphs from all the three variants on the same plot.

As a sanity check, Batch GD should converge in 10-300 iterations and SGD between 500-3000 iterations with Mini Batch GD somewhere in-between. However, the number of iterations may vary greatly due to randomness. If your implementation consistently takes longer, it may have a bug.

What to submit

Plot of \(f(\bm{w}^{(t)},b^{(t)}; \mathcal{D})\) vs. the number of updates (\(t\)). Total time taken for convergence by each of the gradient descent variants. Interpretation of plot and convergence times. [part (a)]
Submit the code to Gradescope. [part (a)]

Data Streams I (40 points)

You are an astronomer at the Space Telescope Science Institute in Baltimore, Maryland, in charge of the petabytes of imaging data they recently obtained. According to the news report linked in the previous sentence, “...The amount of imaging data is equivalent to two billion selfies, or 30,000 times the total text content of Wikipedia. The catalog data is 15 times the volume of the Library of Congress.”

This data stream has images of everything out there in the universe, ranging from stars, galaxies, asteroids, to all kinds of awesome exploding/moving objects. Your task is to determine the approximate frequencies of occurrences of different (unique) items in this data stream.

We now introduce our notation for this problem. Let \(S = \langle a_1, a_2, \ldots, a_t \rangle\) be the given data stream of length \(t\). Let us denote the items in this data stream as being from the set \(\{1, 2, \ldots, n\}\). For any \(1\leq i\leq n\), we denote \(F[i]\) to be the number of times \(i\) has appeared in \(S\). Our goal is then to have good approximations of the values \(F[i]\) for all \(1\leq i\leq n\) at all times.

The naïve way to do this is to just keep the counts for each item \(1\leq i\leq n\) separately. However, this will require \(\mathcal{O}(n)\) space which, in our application, is clearly infeasible. We shall see that it is possible to approximate these counts using a much smaller amount of space. To do so, we consider the algorithm explained below.

Algorithm. The algorithm has two parameters \(\delta\) and \(\varepsilon >0\), and \(\left\lceil \log\frac{1}{\delta}\right\rceil\) independent hash functions \[h_j:\{1,2,\ldots, n\} \rightarrow \{1,2, \ldots, \left\lceil \frac{e}{\varepsilon} \right\rceil\}.\] Note that in this problem, \(\log\) denotes the natural logarithm. For each bucket \(b\) of each hash function \(j\), the algorithm has a counter \(c_{j, b}\) that is initialized to zero.

As each element \(i\) arrives in the data stream, it is hashed by each of the hash functions, and the count for the \(j\)-th hash function \(c_{j, h_j(i)}\) is incremented by \(1\).

For any \(1\leq i\leq n\), we define \(\widetilde{F}[i] = \min_{j} \{c_{j,h_j(i)}\}\) as our estimate of \(F[i]\).

Task. The goal is to show that \(\widetilde{F}[i]\) as defined above provides a good estimate of \(F[i]\).

(a) [3 Points]

What is the memory usage of this algorithm (in Big-\(\mathcal{O}\) notation)? Give a one or two line justification for the value you provide.

(b) [4 Points]

Justify that for any \(1\leq i\leq n\): \[\widetilde{F}[i]\geq F[i].\]

(c) [11 Points]

Prove that for any \(1\leq i\leq n\) and \(1\leq j\leq \lceil \log(\frac{1}{\delta})\rceil\): \[\mathbb{E}\left[c_{j,h_j(i)}\right] \leq F[i] + \frac{\varepsilon}{e} t,\] where, as mentioned, \(t\) is the length of the stream.

(d) [11 Points]

Prove that: \[\mathbb{P}\left[\widetilde{F}[i] \leq F[i] + \varepsilon t\right] \geq 1-\delta.\] Hint: Use Markov inequality and the independence of hash functions.

Based on the proofs in parts (b) and (d), it can be inferred that \(\widetilde{F}[i]\) is a good approximation of \(F[i]\) for any item \(i\) such that \(F[i]\) is not very small (compared to \(t\)). In many applications (e.g., when the values \(F[i]\) have a heavy-tail distribution), we are indeed only interested in approximating the frequencies for items which are not too infrequent. We next consider one such application.

(e) [11 Points]

Warning.

This implementation question requires substantial computation time Python implementation reported to take 15min - 1 hour. Therefore, we advise you to start early.

Dataset.

The dataset in streams/data contains the following files:

words_stream.txt Each line of this file is a number, corresponding to the ID of a word in the stream.
counts.txt Each line is a pair of numbers separated by a tab. The first number is an ID of a word and the second number is its associated exact frequency count in the stream.
words_stream_tiny.txt and counts_tiny.txt are smaller versions of the dataset above that you can use for debugging your implementation.
hash_params.txt Each line is a pair of numbers separated by a tab, corresponding to parameters \(a\) and \(b\) which you may use to define your own hash functions (See explanation below).

Instructions.

Implement the aforementioned algorithm and run it on the dataset with parameters \(\delta = e^{-5}\), \(\varepsilon = e\times 10^{-4}\). (Note: with this choice of \(\delta\) you will be using 5 hash functions - the 5 pairs \((a,b)\) that you’ll need for the hash functions are in hash_params.txt). Then for each distinct word \(i\) in the dataset, compute the relative error \(E_r[i] = \frac{\widetilde{F}[i] - F[i]}{F[i]}\) and plot these values as a function of the exact word frequency \(\frac{F[i]}{t}\). You do not have to implement the algorithm in Spark.

The plot should use a logarithm scale both for the \(x\) and the \(y\) axes, and there should be ticks to allow reading the powers of 10 (e.g. \(10^{-1}\), \(10^0\), \(10^1\) etc...). The plot should have a title, as well as the \(x\) and \(y\) axis labels. The exact frequencies \(F[i]\) should be read from the counts file. Note that words of low frequency can have a very large relative error. That is not a bug in your implementation, but just a consequence of the bound we proved in question (a).

Answer the following question by reading values from your plot: What is an approximate condition on a word frequency in the document to have a relative error below \(1 = 10^0\) ?

Hash functions.

You may use the following hash function (see example pseudocode), with \(p = 123457\), \(a\) and \(b\) values provided in the hash params file and n_buckets (which is equivalent to \(\left\lceil \frac{e}{\varepsilon} \right\rceil\)) chosen according to the specification of the algorithm. In the provided file, each line gives you \(a\), \(b\) values to create one hash function.

# Returns hash(x) for hash function given by parameters a, b, p and n_buckets
def hash_fun(a, b, p, n_buckets, x) {
    y = x [modulo] p
    hash_val = (a*y + b) [modulo] p
    return hash_val [modulo] n_buckets
}

Note: This hash function implementation produces outputs of value from \(0\) to \((\texttt{n\_buckets}-1)\), which is different from our specification in the Algorithm part. You can either keep the range as \(\{0, ..., \texttt{n\_buckets}-1\}\), or add 1 to the hash result so the value range becomes \(\{1, ..., \texttt{n\_buckets}\}\), as long as you stay consistent within your implementation.

What to submit

Expression for the memory usage of the algorithm and justification. [part (a)]
Proofs for parts (b)-(d).
Log-log plot of the relative error as a function of the frequency. An approximate condition on a word frequency to have a relative error below 1. [part (e)]
Submit the code to Gradescope. [part (e)]

Data Streams II (30 points)

We are in the same setup as the previous part, working with the stream \(S = \langle a_1, a_2, \hdots, a_t\rangle\) consisting of items from the set \(\{1, 2, \hdots, n\}\); the frequency of element \(i\) is again denoted by \(F(i)\). Impressed by the quality of your estimator for frequency of occurences of different objects in the telescope’s data stream, you have been now tasked with estimating more sophisticated summary statistics from this stream. We’ll now explore how to estimate the sum of squared frequencies of all the items in the data stream. That is, we wish to estimate \(M = \sum_{i = 1}^n F(i)^2\). Here’s a proposed algorithm.

Algorithm.

Fix a function \(h:[n] \rightarrow \{\pm 1\}\) that associates each item in the data stream with a random sign.
Initialize \(Z = 0\).
Every time an element \(j\) appears in the data stream, add \(h(j)\) to \(Z\). That is, increment \(Z\) if \(h(j) = +1\) and decrement \(Z\) if \(h(j)= -1\).
After processing all the \(t\) elements in the data stream, return the estimate \(X = Z^2\)

Note that since we fix \(h\) before we receive the data stream, an element \(j\) is treated consistently every time it shows up: \(Z\) is either incremented every time \(j\) shows up or is decremented every time \(j\) shows up. In the end, element \(j\) contributes \(h(j)\cdot F(j)\) to the final value of \(Z\).

(a) [12 Points]

Prove that \[\mathbb{E}_h [X] = M\]

(b) [18 Points]

Prove that \[\operatorname{Var}(X) \leq 4M^2\] Note that this bound is not at all tight. A tighter bound can be easily obtained by a more careful derivation similar to the one required for proving the requested bound.

Thus, part (a) shows that the estimator designed in the given algorithm is unbiased, while part (b) gives a bound on its variance.

What to submit

Proof that \(\mathbb{E}_h [X] = M\). [parts (a)]
Proof that \(\operatorname{Var}(X) \leq 4M^2\). [part (b)]