CSE 373, Spring 2018: Project 3: Part 2a

Overview

Due Fri, May 18 at 11:59pm

In parts 2a and 2b, you will use the data structures and algorithms you implemented so far to build a basic search engine similar to Google Search or Microsoft Bing.

Note that we have NOT covered the algorithm for doing this in class: part of the challenge is figuring out how to implement new algorithms based mainly on a written description. If you're running into issues or have questions, we strongly encourage you to come to office hours.

You will use these files from prior assignments:

  • main.java.datastructures.concrete.dictionaries.ChainedHashDictionary.java
  • main.java.datastructures.concrete.dictionaries.ArrayDictionary.java
  • main.java.datastructures.concrete.ArrayHeap.java
  • main.java.datastructures.concrete.ChainedHashSet.java
  • main.java.datastructures.concrete.DoubleLinkedList.java
  • main.java.misc.Searcher.java

You will be modifying the following file:

  • main.java.search.analyzers.TfIdfAnalyzer.java

Additionally, here are a few more files that you might want to review while completing the assignment (note that this is just a starting point, not necessarily an exhaustive list):

  • main.java.search.models.Webpage.java
  • main.java.search.Main.java

General overview

High-level context

A search engine is, in a nutshell, an algorithm that accepts a query from the user and computes how relevant a given document is to that query.

In our case, a query is whatever the user types into our search engine, and each individual web page is a single document.

Industrial strength search engines work by combining hundreds of different algorithms for computing relevance, but we will implement just two: Term Frequency and Inverse Document Frequency (TF-IDF) with Cosine Similarity ranking, and (in part 2b) PageRank.

Detailed descriptions of these algorithms are included later in this spec.

Part 2a i: Getting webpages

Our search engine won't be capable of searching the entire internet: we need to download a copy of every single webpage in existence to do that, which is probably not feasible.

Instead, we'll have our search engine work against a much smaller subset of the internet that we've downloaded for you.

First, start by downloading this data zip file. This zip file contains the web pages that your search engine will be searching against.

Extract the zip file, and add it to your project3 folder. (So, the data folder should live on the same level as your src and experimentdata folders.

Now, take a look inside the data folder. This folder should contain 7 different folders – each one of these folders represents a different "snapshot" or "fraction" of the internet.

When setting up your search engine, you must pick one of these "snapshots": your engine will search for only the documents listed in that folder. These snapshots vary in size. The smaller ones (such as 'gutenberg') folders are meant to help with testing, the larger ones (such as 'wikipedia') are meant to help with stress-testing.

Part 2a ii: Run the search engine

Task: make sure you can run the search engine.

We will next make sure you can run the search engine so you can test your results as you go.

First, open the Main.java file, which is located within the search package. The only thing you really need to care in this file is the first DATA_FOLDER_NAME constant at the top of the class.

This constant must be the name of one of the folders in the data folder. You should change this constant when testing your code, but for now, it's ok to leave it alone.

Next, run Main.java (which is located inside the search package). It may take a few seconds to a minute for the search engine to load the files, depending on which data folder you're looking at.

Once it's ready, visit the link printed to the console in your web browser – it should be something like http://localhost:8080.

Hopefully, you should see a working website! Make sure everything works by running a few queries. Since we haven't implemented the actual search algorithm yet, the results will be in no particular order, but you should see some results appear nearly instantaneously.

Expectations on loading and querying speed

While most of the data folders we give you should load in only a few seconds, the "wikipedia" and "wikipedia-with-spam" datasets may both take over a minute. The search engine, however, should be a little faster if you try running it again on the same data folder – it will cache some extra information within that folder to help speed up future runs.

That said, once everything is loaded, the search engine itself should be very fast, even with the Wikipedia datasets. If it isn't, that may be a sign that one of your data structures or algorithms is implement inefficiently somehow. If you ever find that it takes longer then a quarter second to half a second for any query to complete, you should treat that as a red flag.

You may also see some "malformed link exception" warnings on the first run. This is normal: those links genuinely are malformed. (Our parser insists that all links follow the full URL specification, and isn't as lenient about parsing links as web browsers typically are). You can ignore these warnings.

Dealing with out-of-memory errors or slow loading times

For some of the larger datasets ("wikipedia" and "wikipedia-with-spam"), you may run into out-of-memory issues or very slow loading times – specifically, you may see Java throw an java.lang.OutOfMemoryError.

This error can occur because our search engine will attempt to load all of the web pages in the dataset into memory -- if we want to load many data pages, we may end up needing to allocate more memory then Java permits by default.

To fix this, do the following:

  1. In the top bar in Eclipse, click Run then Run Configurations.
  2. On the left pane, under Java Applications, select the application that's giving you the error. Here, it's the Main application.
  3. One you have selected the right application, click the Arguments tab on the right.
  4. Under VM Arguments, add the text "-Xmx5G" (without quotation marks). This will allow Eclipse to grow the heap size up to 5 GB of memory.
  5. Click Run. This will save the setting and run your program again.

What is 'localhost'? What is a port?

The URL you're asked to visit ("http://localhost:8080") may look a little different then what you're used to.

First, notice that we're using the word "localhost" instead of the name of some website. This is because we're running your search engine locally on your computer: the word "localhost" is a shorthand for "this computer".

Next, we add a colon and some numbers to the end. Those numbers are a port. In short, every program that wants to communicate over the network (games, your email client, chat clients...) must select a unique port number. Then, when somebody wants to send data to a particular program, they need to include that port number with the data. That way, when the operating recieves the data packet, it knows which program to send that packet to.

Websites, by convention, will use specific ports (port 80 for HTTP websites and port 443 for HTTPS), but using those ports typically requires special permission from the operating system.

(This is why we can normally omit the port number when typing web addresses: our web browsers are smart enough to automatically use one of these two default port numbers.)

This can be a hassle, so we configured our search engine to use an arbitrary port number (8080) that doesn't require permissions from your operating system to use and that (hopefully) no other program is currently using.

Part 2a iii: Implement code to generate a TF-IDF vector

Task: Finish implementing TfIdfAnalyzer's constructor (and all associated helper methods).

The first ranking metric we will write is Term Frequency and Inverse Document Frequency (TF-IDF) ranking. We will start by implementing a method that produces a TF-IDF vector for each document we were given. Each TF-IDF vector contains data on how "important" a given word was to that document.

At a high level...

Before we begin with describing the exact algorithm we will be implementing, we want to figure out at a high level what we're planning on doing. If you just skip over this portion and start implementing the code, it will be hard to understand the full context for what you are implementing.

First, we need to understand what the TF-IDF vector really represents. Let's start with the TF portion of the TF-IDF vector. The term frequency simply represents how often different words show up on a given page. We will implement this in the context of a full page - simply put, the term frequency for a given word just amounts to the fraction of the page it is. If a web page consisted only of the words "correct horse battery staple", then the term frequency for any of of those words would be 1/4 or 0.25. If instead the page contained just the words "horse horse cow horse" we would want to know that horse is more "important" to the page; this would correspond to a term frequency of 0.25 for "cow" and 0.75 for "horse."

The inverse document frequency, on the other hand, is not used to describe just a single page; instead, it is used in the context of describing the words across all pages. The reason behind doing this is to identify how likely a word is to actually be important - a word like "the" is likely to appear in pretty much every (english) webpage, whereas a word like "ganglioneuralgia" will only really come up in a very small number of fancy website about ice cream induced brain freezes (and webpages about weird medical terms). The intuition behind the inverse document frequency is to account for this; we want to basically ignore the word "the" most of the time, but care about "rarer" words much more, so we just count how many pages each word shows up in to see how "rare" it is. This number is used in the formula described in the next section so that words that show up pretty much everywhere end up having an IDF score of about ln(1), which comes out to about 0, and words that are a bit more rare will be assigned the natural log of a much larger value - which will ultimately make them have a higher overall relevance to the search.

So how do we actually use these? We want to combine the two factors factors, and multiplying them together is a pretty reasonable choice; that way, words with a higher frequency showing up more due to the TF being higher, and words that are likely less important to the search will be (because they show up in most every page) basically eliminated from the search (as the IDF for that word will be very low). In part 2d, we will be using a user's query (using approximately the same process to generate a new TF-IDF vector for it) to identify which documents are the 'most relevant' to it.

Core algorithm

To compute the TF-IDF vector for a single document (a webpage), we take that document and compute the TF-IDF score for each unique word. For each unique word, or term, we:

  1. Compute TF: Term Frequency

    The term frequency (TF) is a measure of how frequently a term appears in a document. We compute it using this formula:

    $$\text{TF}(\text{term}, \text{document}) = \frac {\text{Number of times the term appears in doc}} {\text{Total number of words in a doc}}$$

    Notice the more the term appears in the document, the higher the TF score. This hopefully makes intuitive sense: if the user's search query is "donuts" and if the word "donuts" appears many times in the webpage, then the webpage is probably all about donuts and is something the user will want to read.

  2. IDF: Inverse Document Frequency

    The inverse document frequency (IDF) measures how frequently a term appears in all documents using this formula. (Note: \(\ln(...)\) means "log base \(e\)".)

    $$\text{IDF}(\text{term}) = \begin{cases} 0 & \text{if term doesn't appear in any doc} \\ \ln\left(\frac {\text{Total number of docs}} {\text{Number of docs containing the term}} \right) & \text{Otherwise} \end{cases} $$

    Note that when computing the TF for a particular term, we'll obtain a different TF per each document. In contrast, when computing IDF, we do so by looking at every single document, rather then any single one.

    Also notice that if a word appears in many different documents, the IDF score is smaller. This is because the more frequently a word appears, the less important it tends to be. For example, words like "the" or "of" appear in nearly every document – so, we don't really want it to be a distinguishing feature.

    We take the natural log because it seems to work well in practice: statistically, only a small handful of words will be common across a majority of documents and we want to penalize those words more.

    (If you'd like to learn more about the underlying theoretical justification for IDF's formula, see this research paper.)

  3. Combined TF-IDF score

    Finally, we compute the final TF-IDF relevance score for the term by multiplying the two above numbers together:

    $$\text{Relevance}(\text{term}, \text{document}) = \text{TF}(\text{term}, \text{document}) \times \text{IDF}(\text{term})$$

We store each term and its corresponding TF-IDF score for a particular document inside a IDictionary<String, Double>. This dictionary is the TF-IDF vector.

The TfIdfAnalyzer.computeAllDocumentTfIdf(...) method is responsible for computing the TF-IDF vector for every document: it should return a IDictionary<URI, IDictionary<String, Double>>.

We will store the resulting dictionary in a field for later use.

Example

Let's try running through an example of computing TF-IDF in action.

Suppose our search engine contained the following three documents (a.k.a. three webpages):

  • Document A: "the mouse played with the cat"

  • Document B: "the quick brown fox jumped over the lazy dog"

  • Document C: "dog 1 and dog 2 ate the hot dog"

We compute the TF-IDF vector for each document like so:

$$ \newcommand{\Relevance}[2]{\text{Relevance}(\text{#1},\, \text{Doc}_{#2})} \newcommand{\f}[2]{\frac{#1}{#2}} \newcommand{\logp}[1]{\ln\left(#1\right)} \newcommand{\expr}[4]{\f{#1}{#2} \times \logp{\f{#3}{#4}}} $$
  • Document A

    Document A: "the mouse played with the cat"

    $$ \begin{align*} \Relevance{the}{A} &= \expr{2}{6}{3}{3} = 0.0 \\ \Relevance{mouse}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \Relevance{played}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \Relevance{with}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \Relevance{cat}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \end{align*} $$

    So, the final dictionary will look like {"the"=0.0, "mouse"=0.183102, "played"=0.183102, "with"=0.183102, "cat"=0.183102}. We'll omit the final dictionary for the other two examples to save space.

  • Document B

    Document B: "the quick brown fox jumped over the lazy dog"

    $$ \begin{align*} \Relevance{the}{B} &= \expr{2}{9}{3}{3} = 0.0 \\ \Relevance{quick}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{brown}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{fox}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{jumped}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{over}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{lazy}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{dog}{B} &= \expr{1}{9}{3}{2} = 0.045052\\ \end{align*} $$

  • Document C

    Document C: "dog 1 and dog 2 ate the hot dog"

    $$\begin{align*} \Relevance{dog}{C} &= \expr{3}{9}{3}{2} = 0.135155 \\ \Relevance{1}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{and}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{2}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{ate}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{the}{C} &= \expr{1}{9}{3}{3} = 0.0 \\ \Relevance{hot}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \end{align*} $$

Ultimately, we'll obtain a unique TF-IDF vector for each document. We'll use these vectors in the next subpart.

Other notes and hints

  • Precomputing

    Computing the TF and IDF scores for a single document can be somewhat expensive: not only do we need to loop through the entire document word-by-word to compute the TF, we need to loop through every single document in our search engine to compute the IDF.

    To help make this more efficient, you should precompute as much information as possible. For more details on how to do so, see the comments in TfIdfAnalyzer's constructor.

  • for each

    You will find that you need to loop over your lists and dictionaries fairly frequently using their iterator. This can be a little inconvenient: having to manually call .iterator() and write a while loop each time is clunky.

    To help clean up your code, try using for-each loops instead. That is, instead of doing:

    Iterator<KVPair<String, Double>> iter = someDict.iterator();
    while (iter.hasNext()) {
        KVPair<String, Double> pair = iter.next();
        // ...
    }
    

    ...you should instead do:

    for (KVPair<String, Double> pair : someDict) {
        // ...
    }
    

    These two code snippets are exactly equivalent, but the second version looks much cleaner (and requires typing fewer characters!).

  • Concurrent Modification

    When iterating over some data structure, do NOT attempt to simultaneously modify the same data structure!

    Just reading from a data structure we're iterating over is completely ok, but writing to the data structure can end up breaking your code. For example:

    for (KVPair<String, Double> pair : someDict) {
        // Dangerous!
        someDict.remove(pair.getKey());
        someDict.put(someNewKey(), 0.0);
    
        // Not as dangerous, since we're not adding a new key-value pair, 
        // but should still be avoided
        someDict.put(pair.getKey(), pair.getValue() + 1);
    
        // Completely safe -- nothing wrong with doing this
        someDict.get(pair.getKey());
        someDict.containsKey(someRandomKey());
    }
    

    The reason why you should never modify something you're iterating over like this is because we designed our iterators under the assumption that the client will never modify the data while an iterator is "live". If you break that precondition, you will get unpredictable results.

    (Java's default data structures are a little more robust and will attempt to raise a ConcurrentModificationException if you try doing this: for the sake of simplicity, we didn't implement this functionality in this class.)

    What you should instead do is modify your algorithms so they "build up" new data structures. Loop over your old data structure, and add the key-value pairs you need to a different, initially empty, data structure.

  • Floating Point Errors

    You should seek to minimize the number of computations that take place; if you have to add a lot of small values, you are likely to end up having errors due to the nature of how floating point numbers are stored. To make sure the values you get are similar to the expected results (and likely more important to you, pass the tests), you should try to factor out division, multiplication, and other operations as much as possible; for example, if you have something like the following in your solution

    double sum = 0.0;
    for (int i = 0; i < list.size(); i++) {
        sum += ((double) list.get(i)) / list.size();
    }
    

    you will likely end up accumulating a lot of inaccuracy in your solution due to floating point arithmetic; instead, you could refactor that code to the following

    double sum = 0.0;
    for (int i = 0; i < list.size(); i++) {
        sum += list.get(i);
    }
    sum /= list.size();
    

    This reduces the error by both reducing the number of times that you divide to find the overall value, and by reducing the number of times that you add small values that might be harder to accurately represent as a double.

Part 2a iv: Implement cosine similarity

Task: Finish implementing TfIdfAnalyzer.computeRelevance(...).

After finishing the last part, we can compute a TF-IDF vector for each document that encodes the importance of each word to that document. Now, we want to compute the TF-IDF vectors for each document and the search query, and then compare them to determine how relevant each document is to the query. To do this, we will use cosine similarity, a method of quantifying how close two vectors are.

Core logic

To compare a query against a document, we will need to perform three steps:

  1. Find the TF-IDF vector for the document. This should be an easy, \(O(1)\) lookup since we already computed the TF-IDF vector for that document in part 2b.

  2. Compute the TF-IDF vector for the query. You should be able to reuse the bulk of the code you wrote in the previous parts to do this (though you may have to do some mild refactoring).

    Note: technically, we are treating the query as if it were a new document. However, you should not recompute the IDF values: just use the ones you computed earlier.

  3. Compute the cosine similarity between the document vector and the query vector.

    Here's the pseudocode for doing this

    double computeRelevance(query, pageUri):
        documentVector = look up document TF-IDF vector using pageUri
        queryVector = compute query TF-IDF vector
    
        numerator = 0.0
        foreach unique word in query:
            docWordScore = if   documentVector.containsKey(word)
                           then documentVector.get(word) 
                           else 0.0
            queryWordScore = queryVector.get(word)
            numerator += docWordScore * queryWordScore
        
        denominator = norm(documentVector) * norm(queryVector)
    
        if   denominator != 0
        then return numerator / denominator
        else return 0.0
    
    double norm(vector):
        output = 0.0
        foreach pair in vector:
            score = pair.getValue()
            output += score * score
        return sqrt(output)
  4. Make the pseudocode more efficient.

    However, the pseudocode we gave you is not particularly efficient: you should be able to combine several loops and precompute some pieces of information needed by this method.

    When implementing this method, you should add at least one other field to your TfIdfAnalyzer class containing information you can precompute. Modify your constructor to precompute and store this information.

Why is this called 'cosine' similarity? Why does this work?

Ok, here's what's actually happening.

The document and query vectors we're computing are in reality vectors of length \(n\), where \(n\) is the number of unique words across all the documents and the current query. Each "index" of the vector corresponds to a specific word and stores the TF-IDF score for that term.

We represent this vector using a dictionary because our vector is likely to be sparse: if we have many documents, an arbitrary vector is likely to contain zero scores for most of the \(n\) words.

On a higher level, what we've basically done is figured out how to convert our (text) document and query into mathematical vectors. We can now apply loads of interesting mathematical properties to manipulate this representation of our document in query.

In particular, we're interested in trying to measure how "similar" two vectors are. We'll do so by attempting to find the "angle" between our query and document vectors: the smaller the angle, the more similar the two are.

To do this, we can take advantage of the following identity:

$$ \textbf{a} \cdot \textbf{b} = ||\textbf{a}|| \, ||\textbf{b}|| \cos(\theta) $$

...where \(\textbf{a}\) and \(\textbf{b}\) are vectors, the \(\cdot\) operation is the dot product, and \(||\textbf{a}||\) is the magnitude of the vector (we take the Euclidean norm here).

We then rearrange and compute the following:

$$ \cos(\theta) = \frac{\textbf{a} \cdot \textbf{b}}{||\textbf{a}|| \, ||\textbf{b}||} $$

We then use the computed value for \(\cos(\theta)\) as our similarity score.

We don't bother taking the arccosine of this quantity because it'll just perform the same transformation on all of our scores, which is a waste of time. We also get to preserve the property that similar vectors will have a higher score, and that all scores will fall between 0.0 and 1.0 (since every score in our vector is positive).

We must, however, handle one edge case: where one or both of the norms are equal to zero. This happens when the user query uses completely unique words that were never before seen in any document, for example. In this case, the output of the cosine similarity metric is undefined, since we're dividing by zero. However, if the query contains no previously seen words, we're not going to be able to classify anything anyways, so we effectively give up and just return 0.0 as a placeholder value.

Part 2a v: Re-run your search engine

Task: Try running your search engine again.

After completing the previous parts, try running your search engine again on the different data folders.

As before, it may take some time for your search engine to finish loading all the pages and finish setting up your TF-IDF analyzer. However, once everything is set up, your search engine should be able to answer queries nearly instantaneously.

But unlike before, your engine should be doing much better this time! You should be getting back mostly reasonable results. The one exception should be if you select the "wikipedia-with-spam" data folder. If you try picking this folder, you may see links to a few spam websites interspersed with the wikipedia links we care about.

More specifically, we've inserted two spam webpages – one related to computer science, and another related to Seattle. Try running searches related to those two things! (E.g. try searching "computer science", "seattle", "seattle times", etc). Our fake spam pages won't necessarily be the very first result, but they should consistently appear near the top.

You should think about why that is. What are these webpages doing to artifically inflate their ranking results?

You'll implement an algorithm to deal with this problem in part 2b of this project.