Project 3: Part 2

Overview

Due Fri, Feb 23 at 11:30pm

In parts 2 and 3, you will use the data structures and algorithms you implemented so far to build a basic search engine similar to Google Search or Microsoft Bing.

Note that we have NOT covered the algorithm for doing this in class: part of the challenge is figuring out how to implement new algorithms based mainly on a written description. If you're running into issues or have questions, we strongly encourage you to come to office hours.

General overview

High-level context

A search engine is, in a nutshell, an algorithm that accepts a query from the user and computes how relevant a given document is to that query.

In our case, a query is whatever the user types into our search engine, and each individual web page is a single document.

Industrial strength search engines work by combining hundreds of different algorithms for computing relevance, but we will implement just two: Term Frequency and Inverse Document Frequency (TF-IDF) with Cosine Similarity ranking, and (in part 3) PageRank.

Detailed descriptions of these algorithms are included later in this spec.

Part 2a: Getting webpages

Our search engine won't be capable of searching the entire internet: we need to download a copy of every single webpage in existence to do that, which is probably not feasible.

Instead, we'll have our search engine work against a much smaller subset of the internet that we've downloaded for you.

First, start by downloading this data zip file. This zip file contains the web pages that your search engine will be searching against.

Extract the zip file, and add it to your project3 folder. (So, the data folder should live on the same level as your src and experimentdata folders.

Now, take a look inside the data folder. This folder should contain 7 different folders – each one of these folders represents a different "snapshot" or "fraction" of the internet.

When setting up your search engine, you must pick one of these "snapshots": your engine will search for only the documents listed in that folder. These snapshots vary in size. The smaller ones (such as 'gutenberg') folders are meant to help with testing, the larger ones (such as 'wikipedia') are meant to help with stress-testing.

You can find more info about each snapshots in the data/README.md file.

Part 2b: Run the search engine

Task: make sure you can run the search engine.

We will next make sure you can run the search engine so you can test your results as you go.

First, open the Main.java file, which is located within the search package. The only thing you really need to care in this file is the first DATA_FOLDER_NAME constant at the top of the class.

This constant must be the name of one of the folders in the data folder. You should change this constant when testing your code, but for now, it's ok to leave it alone.

Next, run Main.java (which is located inside the search package). It may take a few seconds to a minute for the search engine to load the files, depending on which data folder you're looking at.

Once it's ready, visit the link printed to the console in your web browser – it should be something like http://localhost:8080.

Hopefully, you should see a working website! Make sure everything works by running a few queries. Since we haven't implemented the actual search algorithm yet, the results will be in no particular order, but you should see some results appear nearly instantaneously.

Expectations on loading and querying speed

While most of the data folders we give you should load in only a few seconds, the "wikipedia" and "wikipedia-with-spam" datasets may both take over a minute. The search engine, however, should be a little faster if you try running it again on the same data folder – it will cache some extra information within that folder to help speed up future runs.

That said, once everything is loaded, the search engine itself should be very fast, even with the Wikipedia datasets. If it isn't, that may be a sign that one of your data structures or algorithms is implement inefficiently somehow. If you ever find that it takes longer then a quarter second to half a second for any query to complete, you should treat that as a red flag.

You may also see some "malformed link exception" warnings on the first run. This is normal: those links genuinely are malformed. (Our parser insists that all links follow the full URL specification, and isn't as lenient about parsing links as web browsers typically are). You can ignore these warnings.

Dealing with out-of-memory errors or slow loading times

For some of the larger datasets ("wikipedia" and "wikipedia-with-spam"), you may run into out-of-memory issues or very slow loading times – specifically, you may see Java throw an java.lang.OutOfMemoryError.

This error can occur because our search engine will attempt to load all of the web pages in the dataset into memory -- if we want to load many data pages, we may end up needing to allocate more memory then Java permits by default.

To fix this, do the following:

In the top bar in Eclipse, click Run then Run Configurations.
On the left pane, under Java Applications, select the application that's giving you the error. Here, it's the Main application.
One you have selected the right application, click the Arguments tab on the right.
Under VM Arguments, add the text "-Xmx5G" (without quotation marks). This will allow Eclipse to grow the heap size up to 5 GB of memory.
Click Run. This will save the setting and run your program again.

What is 'localhost'? What is a port?

Part 2c: Implement code to generate a TF-IDF vector

Task: Finish implementing TfIdfAnalyzer's constructor (and all associated helper methods).

The first ranking metric we will write is Term Frequency and Inverse Document Frequency (TF-IDF) ranking. We will start by implementing a method that produces a TF-IDF vector for each document we were given. Each TF-IDF vector contains data on how "important" a given word was to that document.

Core algorithm

To compute the TF-IDF vector for a single document (a webpage), we take that document and compute the TF-IDF score for each unique word. For each unique word, or term, we:

Compute TF: Term Frequency

The term frequency (TF) is a measure of how frequently a term appears in a document. We compute it using this formula:

$$\text{TF}(\text{term}, \text{document}) = \frac {\text{Number of times the term appears in doc}} {\text{Total number of words in a doc}}$$

Notice the more the term appears in the document, the higher the TF score. This hopefully makes intuitive sense: if the user's search query is "donuts" and if the word "donuts" appears many times in the webpage, then the webpage is probably all about donuts and is something the user will want to read.
IDF: Inverse Document Frequency

The inverse document frequency (IDF) measures how frequently a term appears in all documents using this formula. (Note: $\ln(...)$ means "log base $e$".)

$$\text{IDF}(\text{term}) = \begin{cases} 0 & \text{if term doesn't appear in any doc} \\ \ln\left(\frac {\text{Total number of docs}} {\text{Number of docs containing the term}} \right) & \text{Otherwise} \end{cases} $$

Note that when computing the TF for a particular term, we'll obtain a different TF per each document. In contrast, when computing IDF, we do so by looking at every single document, rather then any single one.

Also notice that if a word appears in many different documents, the IDF score is smaller. This is because the more frequently a word appears, the less important it tends to be. For example, words like "the" or "of" appear in nearly every document – so, we don't really want it to be a distinguishing feature.

We take the natural log because it seems to work well in practice: statistically, only a small handful of words will be common across a majority of documents and we want to penalize those words more.

(If you'd like to learn more about the underlying theoretical justification for IDF's formula, see this research paper.)
Combined TF-IDF score

Finally, we compute the final TF-IDF relevance score for the term by multiplying the two above numbers together:

$$\text{Relevance}(\text{term}, \text{document}) = \text{TF}(\text{term}, \text{document}) \times \text{IDF}(\text{term})$$

We store each term and its corresponding TF-IDF score for a particular document inside a IDictionary<String, Double>. This dictionary is the TF-IDF vector.

The TfIdfAnalyzer.computeAllDocumentTfIdf(...) method is responsible for computing the TF-IDF vector for every document: it should return a IDictionary<URI, IDictionary<String, Double>>.

We will store the resulting dictionary in a field for later use.

Example

Let's try running through an example of computing TF-IDF in action.

Suppose our search engine contained the following three documents (a.k.a. three webpages):

Document A: "the mouse played with the cat"
Document B: "the quick brown fox jumped over the lazy dog"
Document C: "dog 1 and dog 2 ate the hot dog"

We compute the TF-IDF vector for each document like so:

$$ \newcommand{\Relevance}[2]{\text{Relevance}(\text{#1},\, \text{Doc}_{#2})} \newcommand{\f}[2]{\frac{#1}{#2}} \newcommand{\logp}[1]{\ln\left(#1\right)} \newcommand{\expr}[4]{\f{#1}{#2} \times \logp{\f{#3}{#4}}} $$

Document A: "the mouse played with the cat"

$$ \begin{align*} \Relevance{the}{A} &= \expr{2}{6}{3}{3} = 0.0 \\ \Relevance{mouse}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \Relevance{played}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \Relevance{with}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \Relevance{cat}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \end{align*} $$

So, the final dictionary will look like {"the"=0.0, "mouse"=0.183102, "played"=0.183102, "with"=0.183102, "cat"=0.183102}. We'll omit the final dictionary for the other two examples to save space.
Document B: "the quick brown fox jumped over the lazy dog"
$$ \begin{align*} \Relevance{the}{B} &= \expr{2}{9}{3}{3} = 0.0 \\ \Relevance{quick}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{brown}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{fox}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{jumped}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{over}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{lazy}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{dog}{B} &= \expr{1}{9}{3}{2} = 0.045052\\ \end{align*} $$
Document C: "dog 1 and dog 2 ate the hot dog"

$$\begin{align*} \Relevance{dog}{C} &= \expr{3}{9}{3}{2} = 0.135155 \\ \Relevance{1}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{and}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{2}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{ate}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{the}{C} &= \expr{1}{9}{3}{3} = 0.0 \\ \Relevance{hot}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \end{align*} $$

Ultimately, we'll obtain a unique TF-IDF vector for each document. We'll use these vectors in the next subpart.

Other notes and hints

Computing the TF and IDF scores for a single document can be somewhat expensive: not only do we need to loop through the entire document word-by-word to compute the TF, we need to loop through every single document in our search engine to compute the IDF.

To help make this more efficient, you should precompute as much information as possible. For more details on how to do so, see the comments in TfIdfAnalyzer's constructor.
You will find that you need to loop over your lists and dictionaries fairly frequently using their iterator. This can be a little inconvenient: having to manually call .iterator() and write a while loop each time is clunky.

To help clean up your code, try using for-each loops instead. That is, instead of doing:
```
Iterator<KVPair<String, Double>> iter = someDict.iterator();
while (iter.hasNext()) {
    KVPair<String, Double> pair = iter.next();
    // ...
}
```
...you should instead do:
```
for (KVPair<String, Double> pair : someDict) {
    // ...
}
```
These two code snippets are exactly equivalent, but the second version looks much cleaner (and requires typing fewer characters!).
When iterating over some data structure, do NOT attempt to simultaneously modify the same data structure!

Just reading from a data structure we're iterating over is completely ok, but writing to the data structure can end up breaking your code. For example:
```
for (KVPair<String, Double> pair : someDict) {
    // Dangerous!
    someDict.remove(pair.getKey());
    someDict.put(someNewKey(), 0.0);

    // Not as dangerous, since we're not adding a new key-value pair, 
    // but should still be avoided
    someDict.put(pair.getKey(), pair.getValue() + 1);

    // Completely safe -- nothing wrong with doing this
    someDict.get(pair.getKey());
    someDict.containsKey(someRandomKey());
}
```
The reason why you should never modify something you're iterating over like this is because we designed our iterators under the assumption that the client will never modify the data while an iterator is "live". If you break that precondition, you will get unpredictable results.

(Java's default data structures are a little more robust and will attempt to raise a ConcurrentModificationException if you try doing this: for the sake of simplicity, we didn't implement this functionality in this class.)

What you should instead do is modify your algorithms so they "build up" new data structures. Loop over your old data structure, and add the key-value pairs you need to a different, initially empty, data structure.

Part 2d: Implement cosine similarity

Task: Finish implementing TfIdfAnalyzer.computeRelevance(...).

Next, we will implement the logic to actually determine how relevant an algorithm is to a query.

Core logic

To compare a query against a document, we will need to perform three steps:

Find the TF-IDF vector for the document. This should be an easy, $O(1)$ lookup since we already computed the TF-IDF vector for that document in part 2b.
Compute the TF-IDF vector for the query. You should be able to reuse the bulk of the code you wrote in the previous parts to do this (though you may have to do some mild refactoring).

Note: technically, we are treating the query as if it were a new document. However, you should not recompute the IDF values: just use the ones you computed earlier.

Compute the cosine similarity between the document vector and the query vector.

We will do this by comparing the TF-IDF vector for your query against all the document TF-IDF vectors and compute how similar your query vector is to each document by exploiting properties of the cosine function and the dot product.

However, since not everybody may have taken math classes about this kind of material, we will instead just provide you with the pseudocode for this material instead:

double computeRelevance(query, pageUri):
    documentVector = look up document TF-IDF vector using pageUri
    queryVector = compute query TF-IDF vector

    numerator = 0.0
    foreach word in query:
        docWordScore = if   documentVector.containsKey(word)
                       then documentVector.get(word) 
                       else 0.0
        queryWordScore = queryVector.get(word)
        numerator += docWordScore * queryWordScore
    
    denominator = norm(documentVector) * norm(queryVector)

    if   denominator != 0
    then return numerator / denominator
    else return 0.0

double norm(vector):
    output = 0.0
    foreach pair in vector:
        score = pair.getValue()
        output += score * score
    return sqrt(output)

Make the pseudocode more efficient.

However, the pseudocode we gave you is not particularly efficient: you should be able to combine several loops and precompute some pieces of information needed by this method.

When implementing this method, you should add at least one other field to your TfIdfAnalyzer class containing information you can precompute. Modify your constructor to precompute and store this information.

Why is this called 'cosine' similarity? Why does this work?

Ok, here's what's actually happening.

The document and query vectors we're computing are in reality vectors of length $n$, where $n$ is the number of unique words across all the documents and the current query. Each "index" of the vector corresponds to a specific word and stores the TF-IDF score for that term.

We represent this vector using a dictionary because our vector is likely to be sparse: if we have many documents, an arbitrary vector is likely to contain zero scores for most of the $n$ words.

On a higher level, what we've basically done is figured out how to convert our (text) document and query into mathematical vectors. We can now apply loads of interesting mathematical properties to manipulate this representation of our document in query.

In particular, we're interested in trying to measure how "similar" two vectors are. We'll do so by attempting to find the "angle" between our query and document vectors: the smaller the angle, the more similar the two are.

To do this, we can take advantage of the following identity:

$$ \textbf{a} \cdot \textbf{b} = ||\textbf{a}|| \, ||\textbf{b}|| \cos(\theta) $$

...where $\textbf{a}$ and $\textbf{b}$ are vectors, the $\cdot$ operation is the dot product, and $||\textbf{a}||$ is the magnitude of the vector (we take the Euclidean norm here).

We then rearrange and compute the following:

$$ \cos(\theta) = \frac{\textbf{a} \cdot \textbf{b}}{||\textbf{a}|| \, ||\textbf{b}||} $$

We then use the computed value for $\cos(\theta)$ as our similarity score.

We don't bother taking the arccosine of this quantity because it'll just perform the same transformation on all of our scores, which is a waste of time. We also get to preserve the property that similar vectors will have a higher score, and that all scores will fall between 0.0 and 1.0 (since every score in our vector is positive).

We must, however, handle one edge case: where one or both of the norms are equal to zero. This happens when the user query uses completely unique words that were never before seen in any document, for example. In this case, the output of the cosine similarity metric is undefined, since we're dividing by zero. However, if the query contains no previously seen words, we're not going to be able to classify anything anyways, so we effectively give up and just return 0.0 as a placeholder value.

Part 2e: Re-run your search engine

Task: Try running your search engine again.

After completing the previous parts, try running your search engine again on the different data folders.

As before, it may take some time for your search engine to finish loading all the pages and finish setting up your TF-IDF analyzer. However, once everything is set up, your search engine should be able to answer queries nearly instantaneously.

But unlike before, your engine should be doing much better this time! You should be getting back mostly reasonable results. The one exception should be if you select the "wikipedia-with-spam" data folder. If you try picking this folder, you may see links to a few spam websites interspersed with the wikipedia links we care about.

More specifically, we've inserted two spam webpages – one related to computer science, and another related to Seattle. Try running searches related to those two things! (E.g. try searching "computer science", "seattle", "seattle times", etc). Our fake spam pages won't necessarily be the very first result, but they should consistently appear near the top.

You should think about why that is. What are these webpages doing to artifically inflate their ranking results?

You'll implement an algorithm to deal with this problem in part 3 of this project.

CSE 373, Winter 2018: Project 3: Part 2