Assignment Submission All students should submit their
assignments electronically via GradeScope. No handwritten work will be
accepted. Math formulas must be typeset using LaTeX or
other word processing software that supports mathematical symbols (E.g.
Google Docs, Microsoft Word). Simply sign up on Gradescope and use the
course code V84RPB. Please use your UW NetID if possible.
For the non-coding component of the homework, you should upload a PDF
rather than submitting as images. We will use Gradescope for the
submission of code as well. Please make sure to tag each part correctly
on Gradescope so it is easier for us to grade. There will be a small
point deduction for each mistagged page and for each question that
includes code. Put all the code for a single question into a single file
and upload it. Only files in text format (e.g. .txt, .py, .java) will be
accepted. There will be no credit for coding questions without
submitted code on Gradescope, or for submitting it after the
deadline, so please remember to submit your code.
Coding You may use any programming languages and
standard libraries, such as NumPy and PySpark, but you may not use
specialized packages and, in particular, machine learning libraries
(e.g. sklearn, TensorFlow), unless stated otherwise. Ask on the
discussion board whether specific libraries are allowed if you are
unsure.
Late Day Policy All students will be given two
no-questions-asked late periods, but only one late period can be used
per homework and cannot be used for project deliverables. A late-period
lasts 48 hours from the original deadline (so if an assignment is due on
Thursday at 11:59 pm, the late period goes to the Saturday at 11:59pm
Pacific Time).
Academic Integrity We take academic
integrity extremely seriously. We strongly encourage students to
form study groups. Students may discuss and work on homework problems in
groups. However, each student must write down the solutions and the code
independently. In addition, each student should write down the set of
people whom they interacted with.
Discussion Group (People with whom you discussed ideas used in your
answers):
On-line or hardcopy documents used as part of your answers:
I acknowledge and accept the Academic Integrity clause.
*[3ex] (Signed)
Here, you will implement the soft margin SVM using different gradient descent methods as described in the section 12.3.4 of the textbook. Our goal for this problem is to investigate the convergence of different gradient descent methods on a sample dataset and think about the characteristics of these different methods that lead to different performances.
To recap, given a dataset of \(n\) samples \(\mathcal{D} = \left\{\left(\boldsymbol{x}^{(i)}, y^{(i)}\right)\right\}_{i=1}^n\), where every \(d\)-dimensional feature vector \(\boldsymbol{x}^{(i)} \in \mathbb{R}^d\) is associated with a label \(y^{(i)} \in \{-1,1\}\), to estimate the parameters \(\boldsymbol{\theta} = (\boldsymbol{w}, b)\) of the soft margin SVM, we can minimize the loss function: \[\begin{aligned} f(\boldsymbol{w},b; \mathcal{D}) &= \frac{1}{2} \|\boldsymbol{w}\|_2^2 + C \sum_{(\boldsymbol{x}^{(i)}, y^{(i)}) \in \mathcal{D}} \max\left\{0, 1 - y^{(i)}( \boldsymbol{w}\cdot \boldsymbol{x}^{(i)} + b )\right\} \\ &= \frac{1}{2} \|\boldsymbol{w}\|_2^2 + C \sum_{(\boldsymbol{x}^{(i)}, y^{(i)}) \in \mathcal{D}} L(\boldsymbol{x}^{(i)}, y^{(i)}; \boldsymbol{\theta}) \end{aligned}\]
In order to minimize the function, we first obtain the gradient with respect to \(\boldsymbol{\theta}\). The partial derivative with respect to \(w_j\), the \(j\)-th entry in the vector \(\boldsymbol{w}\), is:
\[\begin{aligned} \label{eq:grad_w} \partial_{w_j} f(\boldsymbol{w},b; \mathcal{D}) = \frac{\partial f(\boldsymbol{w},b; \mathcal{D})}{\partial w_j} = w_j + C \sum_{(\boldsymbol{x}^{(i)}, y^{(i)}) \in \mathcal{D}} \frac{\partial L(\boldsymbol{x}^{(i)}, y^{(i)}; \boldsymbol{\theta})}{\partial w_j} \end{aligned}\] where \[\begin{aligned} \frac{\partial L(\boldsymbol{x}^{(i)}, y^{(i)}; \boldsymbol{\theta})}{\partial w_j} = \left\{\begin{array}{cl} 0 & \text{if}\ y^{(i)}\left(\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b \right) \ge 1 \\ -y^{(i)}x_j^{(i)} & \text{otherwise.} \end{array}\right. \end{aligned}\]
and the partial derivative with respect to \(b\) is: \[\begin{aligned} \label{eq:grad_b} \partial_b f(\boldsymbol{w},b;\mathcal{D}) = \frac{\partial f(\boldsymbol{w},b;\mathcal{D})}{\partial b} = C \sum_{(\boldsymbol{x}^{(i)}, y^{(i)}) \in \mathcal{D}} \frac{\partial L(\boldsymbol{x}^{(i)}, y^{(i)}; \boldsymbol{\theta})}{\partial b} \end{aligned}\] where \[\begin{aligned} \frac{\partial L(\boldsymbol{x}^{(i)}, y^{(i)}; \boldsymbol{\theta})}{\partial b} = \left\{\begin{array}{cl} 0 & \text{if}\ y^{(i)}\left(\boldsymbol{w} \cdot \boldsymbol{x}^{(i)} + b \right) \ge 1 \\ -y^{(i)} & \text{otherwise.} \end{array}\right. \end{aligned}\] Since the direction of the gradient is the direction of steepest ascent of the loss function, gradient descent proceeds by iteratively taking small steps along the direction opposite to the direction of gradient. The general framework of gradient descent is given in Algorithm [alg:svm].
Parameters: learning rate \(\eta\), batch size \(\beta\).
Randomly shuffle the training data \(k \gets 0\) \(B \gets \left\{\left(x^{(i)}, y^{(i)}\right) : \beta k+1 \leq i \leq \min\{\beta(k+1), n\}\right\}\) \(w_j^{(t)} \gets w_j^{(t-1)} - \eta \cdot \partial_{w_j} f(\boldsymbol{w}^{(t-1)},b^{(t-1)}; B)\) \(b^{(t)} \gets b^{(t-1)} - \eta \cdot \partial_{b} f(\boldsymbol{w}^{(t-1)}, b^{(t-1)}; B)\) \(k \leftarrow (k + 1 \mod \lceil n/\beta \rceil)\) break
Task: Implement the SVM algorithm using the following gradient descent variants.
For all the variants use \({C = 100}\), \(\boldsymbol{w}^{(0)} = \boldsymbol{0}\), \(b^{(0)} = 0\). For all other parameters, use the values specified in the description of the variant.
Note: update the parameters \(\boldsymbol{w}\) and \(b\) on iteration \(t\) using the values computed on iteration \(t-1\). Do not update using values computed in the current iteration!
Batch Gradient Descent (BGD): When the \(\beta = n\), in every iteration the algorithm uses the entire dataset to compute the gradient and update the parameters.
As a convergence criterion for batch gradient descent we will use \(\Delta_{\% loss}^{(t)} < \varepsilon\), where \[\begin{aligned} \Delta_{\% loss }^{(t)} = \frac{|f(\boldsymbol{w}^{(t-1)}, b^{(t-1)}; \mathcal{D}) - f(\boldsymbol{w}^{(t)}, b^{(t)}; \mathcal{D})|}{f(\boldsymbol{w}^{(t-1)}, b^{(t-1)}; \mathcal{D})}\times100 \label{eq:stop} \end{aligned}\]
Set \(\eta = 3\cdot10^{-7}\), \(\varepsilon = 0.25\).
Stochastic Gradient Descent (SGD): When \(\beta = 1\), in every iteration the algorithm uses one training sample at a time to compute the gradient and update the parameters.
As a convergence criterion for stochastic gradient descent we will use \(\Delta_{loss}^{(t)} < \varepsilon\), where \[\begin{aligned} \Delta_{loss}^{(t)} = \tfrac{1}{2}\Delta_{loss}^{(t-1)} + \tfrac{1}{2}\Delta_{\% loss}^{(t)}, \label{eq:svm-loss} \end{aligned}\] \(t\) is the iteration number, \(\Delta_{\% loss}^{(t)}\) is same as above (equation [eq:stop]) and and \(\Delta_{loss}^{(0)} = 0\).
Use \(\eta = 0.0001\), \(\varepsilon = 0.001\).
Mini-Batch Gradient Descent (MBGD): In every iteration the algorithm uses mini-batches of \(\beta\) samples to compute the gradient and update the parameters.
As a convergence criterion for mini-batch gradient descent we will use \(\Delta_{loss}^{(t)} < \varepsilon\), where \(\Delta_{loss}^{(t)}\) is the same as above (equation [eq:svm-loss]) and \(\Delta_{loss}^{(0)} = 0\)
Use \(\eta = 10^{-5}\), \(\varepsilon = 0.01\) and \(\beta = 20\).
Task: Run your implementation on the
data set in svm/data
. The data set contains the following
files :
features.txt
: Each
line contains the features (comma-separated values) of a single sample.
It has 6414 samples (rows) and 122 features (columns).
target.txt
: Each
line contains the target variable (y = -1 or 1) for the corresponding
row in features.txt
.
Task: Plot the value of the loss function \(f(\boldsymbol{w}^{(t)},b^{(t)}; \mathcal{D})\) vs. the iteration number \(t\) starting from \(t=0\). Label the plot axes. The diagram should have graphs from all the three variants on the same plot. Report the total time (wall clock time, as opposed to the number of iterations) each of the gradient descent variants takes to converge. What do you infer from the plots and the time for convergence? Explain using 4-6 sentences.
Sanity Check 1: The value of the loss function at iteration number \(t = 0\) must be around 641,400.
Sanity Check 2: Batch GD should converge in 10-300 iterations and SGD between 500-3000 iterations with Mini Batch GD somewhere in-between. However, the number of iterations may vary greatly due to randomness. If your implementation consistently takes longer, there might be a bug.
Sanity Check 3: The expected total run time for all 3 methods is around 5-15 minutes but might vary depending on the implementation.
Plot of \(f(\boldsymbol{w}^{(t)},b^{(t)}; \mathcal{D})\) vs. the number of updates (\(t\)) (All 3 graphs should be in the same plot). Total time taken for convergence by each of the gradient descent variants. Interpretation of plot and convergence times.
Submit the code to Gradescope.
In this question, you will implement the 2\(^\text{nd}\)-order random walk of node2vec , which embeds nodes with similar network neighborhoods close in the feature space. A detailed description of the problem setup and background is provided in the template code provided on Colab at https://colab.research.google.com/drive/1x9TbF8w88vFc5-GaaHE-sdVa7ncMapAj.
Task. Implement a biased random walk to generate item sequences for Facebook pages, and create positive and negative training samples from the sequences. Train a word2vec (skip-gram) model to learn embeddings for graph nodes, and evaluate these embeddings qualitatively via examining the k-nearest neighbors of selected Facebook page queries.
Dataset The dataset for this question is a page-page graph of verified Facebook sites. Nodes represent official Facebook pages while the links are mutual likes between sites. Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site. This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories which are defined by Facebook. These categories are: politicians, governmental organizations, television shows and companies. The task related to this dataset is multi-class node classification for the 4 site categories. The dataset is available at http://snap.stanford.edu/data/facebook_large.zip
We provide a networkx graph constructed with the full dataset. Your task is to construct two smaller maximal subgraphs of degree \(k\) and higher, which will be useful in keeping the runtime of random walks and skip-gram training to a feasible range.
Small Graph with \(k = 15\)
Tiny Graph with \(k = 30\)
Hint 1: See documentation for networkx.k_core
Hint 2: Tiny Graph should have \(804\) nodes and \(24266\) edges.
Report the number of nodes and edges in the Small Graph.
Implement the Biased Random Walk of node2vec, which trades off between local and global neighborhood views of a node \(u\) via breadth-first search (BFS) and depth-first search (DFS) respectively. node2vec contains two parameters to interpolate between BFS and DFS:
Return Parameter (p): return back to previous node, i.e. immediately revisit a node in the walk
In-Out Parameter (q): ratio of moving outward (DFS) vs. inward (BFS). A high q value biases the walk to visiting local nodes, and a low q value towards distant nodes.
Recall the un-normalized random walk transition probabilities from Slide 43 in the lecture notes:
to return to previous node = \(1/p\)
to visit a local node = \(1\)
to move forward = \(1/q\)
Instructions. Modify the provided code skeleton and implement and return a random walk sequence with \(p=1\), \(q=1\), \(iters=5\) and \(steps=10\) on the Small Graph, where iters is the number of iterations of random walks on all graph nodes and steps is the number of steps taken on a single random walk starting from a specific node.
Please note that each random walk in the provided template method
random_walk()
, which starts from every possible node, will
be a list of \(\text{length}=steps\),
including the starting node. This corresponds to walking \(steps-1\) steps from the starting node. The
final random walk sequence returned will be a concatenation of \(num\_nodes\) random walks over \(iter=5\). Note that the Tiny
Graph is provided only as a sanity check to help you debug your
code. Submit the code snippet of your random walk (a single Colab cell)
as a png or jpeg file.
Hint: For consistency, the choice of function to carry out
random selection is important. In next_step
, please use numpy.random.choice()
.
Sanity Check: On Tiny Graph, the sequence of
random walks that is returned by the provided code skeleton method
random_walk()
should have \(\text{length}=4020\).
We will now utilize the Small Graph random walk sequence to create positive and negative samples. We will learn embeddings for our Facebook pages by using these samples to train a skip-gram model with negative sampling (SGNS). In simple terms, a skip-gram model tries to predict the context words around a given center word. A more detailed overview of skip-gram modeling can be found at this blog post and the Stanford CS 224D: Deep Learning for NLP Lecture Notes, though it is not required for this question. You will be provided template code to create positive and negative training examples from the random walk sequence, initialize the dataset and keras model, and plot the training loss. You will also be provided hyperparameters to train the model.
Report:
The training loss plot of your skip-gram model
The percentage drop in training loss after \(10\) epochs
Sanity Check: the output of generate_examples()
in the template code on Small
Graph should be:
Targets shape: (2498691,)
Contexts shape: (2498691,)
Labels shape: (2498691,)
Weights shape: (2498691,)
Hint: The total runtime of skip-gram training should be under 10 minutes on Small Graph, and under 2 minutes on Tiny Graph.
Analyze the embeddings learnt with the random walk and skip-gram
model by computing the cosine
similarity of given Facebook page queries with all other embeddings
learnt on the Small Graph. You will be
required to write this code within the provided skeleton. For those
unfamiliar with keras, it is recommended to examine the template code
provided in create_model()
alongside the visualization
generated in keras.utils.plot_model()
to observe where item
embeddings are available.
Report the 5-nearest-neighbors for each of the following four nodes: “Glee", “United States Air Force", “NASA’s Curiosity Mars Rover", and “Barack Obama".
Hint 1: Don’t forget to L2 normalize your embeddings with Numpy or TensorFlow! This essentially reduces cosine similarity to a maximum inner product search (MIPS).
Hint 2: see documentation for tf.math.top_k
Sanity Check 1: On Tiny Graph, The L2 norm of the embeddings \(||embeddings||_2 = 44.6592\)
Sanity Check 2: On Tiny Graph, the 5-nearest neighbors for the query “Brooklyn Nine Nine" are “You The Jury", “MasterChef", “Houdini & Doyle", “Famous", “The Exorcist FOX".
Upload your code to Gradescope as an .ipynb file. The code is to ensure that each person has written their own solution. We may choose to run your code to validate your solution but you do not have to worry about implementation details.
Report the number of nodes and edges after constructing the Small Graph.
Submit the code snippet of your random walk sequence generation (single Colab cell provided in the template code).
Include the plot of skip-gram training accuracy and percentage drop after \(10\) epochs with embeddings learnt on Small Graph.
Include in your writeup the top 5 similar Facebook pages of the queries in part (d) with embeddings learnt on Small Graph.
In this question, we are going to follow an algorithm for determining the approximate frequencies of the unique items in a data stream. We will specifically investigate how we can get a feasible approximation that uses less space than the naive solution but is still a good estimate of the actual frequencies. We will also experiment with a real stream dataset to empirically investigate our claims.
You are an astronomer at the Space Telescope Science Institute in Baltimore, Maryland, in charge of the petabytes of imaging data they recently obtained. According to the news report linked in the previous sentence, “...The amount of imaging data is equivalent to two billion selfies, or 30,000 times the total text content of Wikipedia. The catalog data is 15 times the volume of the Library of Congress.”
This data stream has images of everything out there in the universe, ranging from stars, galaxies, asteroids, to all kinds of awesome exploding/moving objects. Your task is to determine the approximate frequencies of occurrences of different (unique) items in this data stream.
We now introduce our notation for this problem. Let \(S = \langle a_1, a_2, \ldots, a_t \rangle\) be the given data stream of length \(t\). Let us denote the items in this data stream as being from the set \(\{1, 2, \ldots, n\}\). For any \(1\leq i\leq n\), we denote \(F[i]\) to be the number of times \(i\) has appeared in \(S\). Our goal is then to have good approximations of the values \(F[i]\) for all \(1\leq i\leq n\) at all times.
The naïve way to do this is to just keep the counts for each item \(1\leq i\leq n\) separately. However, this will require \(\mathcal{O}(n)\) space which, in our application, is clearly infeasible. We shall see that it is possible to approximate these counts using a much smaller amount of space. To do so, we consider the algorithm explained below.
Algorithm. The algorithm has two parameters \(\delta\) and \(\varepsilon >0\), and \(\left\lceil \log\frac{1}{\delta}\right\rceil\) independent hash functions \[h_j:\{1,2,\ldots, n\} \rightarrow \{1,2, \ldots, \left\lceil \frac{e}{\varepsilon} \right\rceil\}.\] Note that in this problem, \(\log\) denotes the natural logarithm. For each bucket \(b\) of each hash function \(j\), the algorithm has a counter \(c_{j, b}\) that is initialized to zero.
As each element \(i\) arrives in the data stream, it is hashed by each of the hash functions, and the count for the \(j\)-th hash function \(c_{j, h_j(i)}\) is incremented by \(1\).
Note: You can assume that the hash functions are independent and totally random (see: https://courses.csail.mit.edu/6.851/spring12/scribe/lec10.pdf).
For any \(1\leq i\leq n\), we define \(\widetilde{F}[i] = \min_{j} \{c_{j,h_j(i)}\}\) as our estimate of \(F[i]\).
Task. The goal is to show that \(\widetilde{F}[i]\) as defined above provides a good estimate of \(F[i]\).
What is the memory usage of this algorithm (in Big-\(\mathcal{O}\) notation)? Give a one or two line justification for the value you provide.
Justify that for any \(1\leq i\leq n\): \[\widetilde{F}[i]\geq F[i].\]
Hint: You can show that the inequality holds for \(\tilde{F}[i]\) by showing that it holds for the minimum j or that it holds for all j where \(c_{j, h_j(i)}\) denotes the count for the \(j\)-th hash function.
Prove that for any \(1\leq i\leq n\) and \(1\leq j\leq \lceil \log(\frac{1}{\delta})\rceil\): \[\mathbb{E}\left[c_{j,h_j(i)}\right] \leq F[i] + \frac{\varepsilon}{e} t,\] where, as mentioned, \(t\) is the length of the stream.
Prove that: \[\mathbb{P}\left[\widetilde{F}[i] \leq F[i] + \varepsilon t\right] \geq 1-\delta.\] Hint 1: Use Markov inequality and the independence of hash functions.
Hint 2: Use the fact that \(\widetilde{F}[i] = \min_{j} \{c_{j,h_j(i)}\}\) and thus \(\widetilde{F}[i] \leq c_{j,h_j(i)} \forall j\).
Based on the proofs in parts (b) and (d), it can be inferred that \(\widetilde{F}[i]\) is a good approximation of \(F[i]\) for any item \(i\) such that \(F[i]\) is not very small (compared to \(t\)). In many applications (e.g., when the values \(F[i]\) have a heavy-tail distribution), we are indeed only interested in approximating the frequencies for items which are not too infrequent. We next consider one such application.
This implementation question requires substantial computation time. Python implementation is reported to take 15min - 1 hour. Therefore, we advise you to start early.
The dataset in streams/data contains the following files:
words_stream.txt
Each line of this file is a number, corresponding to the ID of a word in
the stream.
counts.txt
Each
line is a pair of numbers separated by a tab. The first number is an ID
of a word and the second number is its associated exact frequency count
in the stream.
words_stream_tiny.txt
and counts_tiny.txt
are smaller
versions of the dataset above that you can use for debugging your
implementation.
hash_params.txt
Each line is a pair of numbers separated by a tab, corresponding to
parameters \(a\) and \(b\) which you may use to define your own
hash functions (See explanation below).
Implement the aforementioned algorithm and run it on the dataset with
parameters \(\delta = e^{-5}\), \(\varepsilon = e\times 10^{-4}\). (Note:
with this choice of \(\delta\) you will
be using 5 hash functions - the 5 pairs \((a,b)\) that you’ll need for the hash
functions are in hash_params.txt
). Then for each distinct
word \(i\) in the dataset, compute the
relative error \(E_r[i] =
\frac{\widetilde{F}[i] - F[i]}{F[i]}\) and plot these values as a
function of the exact word frequency \(\frac{F[i]}{t}\). You do not have
to implement the algorithm in Spark.
The plot should be a scatter plot and should use a logarithm scale both for the \(x\) and the \(y\) axes, and there should be ticks to allow reading the powers of 10 (e.g. \(10^{-1}\), \(10^0\), \(10^1\) etc...). The plot should have a title, as well as the \(x\) and \(y\) axis labels. The exact frequencies \(F[i]\) should be read from the counts file. Note that words of low frequency can have a very large relative error. That is not a bug in your implementation, but just a consequence of the bound we proved in question (a).
Answer the following question by reading values from your plot: What is an approximate condition on a word frequency in the document to have a relative error below \(1 = 10^0\) ?
You may use the following hash function (see example pseudocode),
with \(p = 123457\), \(a\) and \(b\) values provided in the hash params file
and n_buckets
(which is equivalent to \(\left\lceil \frac{e}{\varepsilon}
\right\rceil\)) chosen according to the specification of the
algorithm. In the provided file, each line gives you \(a\), \(b\)
values to create one hash function.
# Returns hash(x) for hash function given by parameters a, b, p and n_buckets
def hash_fun(a, b, p, n_buckets, x) {
y = x [modulo] p
hash_val = (a*y + b) [modulo] p
return hash_val [modulo] n_buckets
}
Note: This hash function implementation produces outputs of value from \(0\) to \((\texttt{n\_buckets}-1)\), which is different from our specification in the Algorithm part. You can either keep the range as \(\{0, ..., \texttt{n\_buckets}-1\}\), or add 1 to the hash result so the value range becomes \(\{1, ..., \texttt{n\_buckets}\}\), as long as you stay consistent within your implementation.
Sanity Check 1: On the tiny dataset, the actual word frequencies should be in range around \(10^{-7}\) to \(10^{-2}\) and the corresponding relative errors should be in range around \(10^{-5}\) to \(10^{5}\). Sanity Check 2: On the tiny dataset, words with frequency roughly in range \(10^{-4}\) to \(10^{-5}\) have a relative error below \(10^{-1}\).
Expression for the memory usage of the algorithm and justification. [part (a)]
Proofs for parts (b)-(d).
Log-log plot of the relative error as a function of the frequency. An approximate condition on a word frequency to have a relative error below 1. [part (e)]
Submit the code to Gradescope. [part (e)]