Due Fri, Feb 23 at 11:30pm
In parts 2 and 3, you will use the data structures and algorithms you implemented so far to build a basic search engine similar to Google Search or Microsoft Bing.
Note that we have NOT covered the algorithm for doing this in class: part of the challenge is figuring out how to implement new algorithms based mainly on a written description. If you're running into issues or have questions, we strongly encourage you to come to office hours.
A search engine is, in a nutshell, an algorithm that accepts a query from the user and computes how relevant a given document is to that query.
In our case, a query is whatever the user types into our search engine, and each individual web page is a single document.
Industrial strength search engines work by combining hundreds of different algorithms for computing relevance, but we will implement just two: Term Frequency and Inverse Document Frequency (TF-IDF) with Cosine Similarity ranking, and (in part 3) PageRank.
Detailed descriptions of these algorithms are included later in this spec.
Our search engine won't be capable of searching the entire internet: we need to download a copy of every single webpage in existence to do that, which is probably not feasible.
Instead, we'll have our search engine work against a much smaller subset of the internet that we've downloaded for you.
First, start by downloading this data zip file. This zip file contains the web pages that your search engine will be searching against.
Extract the zip file, and add it to your project3
folder. (So, the
data
folder should live on the same level as your src
and experimentdata
folders.
Now, take a look inside the data
folder. This folder should contain
7 different folders – each one of these folders represents a different
"snapshot" or "fraction" of the internet.
When setting up your search engine, you must pick one of these "snapshots": your engine will search for only the documents listed in that folder. These snapshots vary in size. The smaller ones (such as 'gutenberg') folders are meant to help with testing, the larger ones (such as 'wikipedia') are meant to help with stress-testing.
You can find more info about each snapshots in the
data/README.md
file.
Task: make sure you can run the search engine.
We will next make sure you can run the search engine so you can test your results as you go.
First, open the Main.java
file, which is located within
the search
package. The only thing you really need to care
in this file is the first DATA_FOLDER_NAME
constant at
the top of the class.
This constant must be the name of one of the folders in the
data
folder. You should change this constant when
testing your code, but for now, it's ok to leave it alone.
Next, run Main.java
(which is located inside the search
package). It may take a few seconds to a minute for the search engine to load the files,
depending on which data folder you're looking at.
Once it's ready, visit the link printed to the console in your web browser – it should be something like http://localhost:8080.
Hopefully, you should see a working website! Make sure everything works by running a few queries. Since we haven't implemented the actual search algorithm yet, the results will be in no particular order, but you should see some results appear nearly instantaneously.
While most of the data folders we give you should load in only a few seconds, the "wikipedia" and "wikipedia-with-spam" datasets may both take over a minute. The search engine, however, should be a little faster if you try running it again on the same data folder – it will cache some extra information within that folder to help speed up future runs.
That said, once everything is loaded, the search engine itself should be very fast, even with the Wikipedia datasets. If it isn't, that may be a sign that one of your data structures or algorithms is implement inefficiently somehow. If you ever find that it takes longer then a quarter second to half a second for any query to complete, you should treat that as a red flag.
You may also see some "malformed link exception" warnings on the first run. This is normal: those links genuinely are malformed. (Our parser insists that all links follow the full URL specification, and isn't as lenient about parsing links as web browsers typically are). You can ignore these warnings.
For some of the larger datasets ("wikipedia" and "wikipedia-with-spam"), you may
run into out-of-memory issues or very slow loading times – specifically, you
may see Java throw an java.lang.OutOfMemoryError
.
This error can occur because our search engine will attempt to load all of the web pages in the dataset into memory -- if we want to load many data pages, we may end up needing to allocate more memory then Java permits by default.
To fix this, do the following:
Main
application.-Xmx5G
" (without quotation
marks). This will allow Eclipse to grow the heap size up to 5 GB of memory.Task: Finish implementing TfIdfAnalyzer's
constructor (and all
associated helper methods).
The first ranking metric we will write is Term Frequency and Inverse Document Frequency (TF-IDF) ranking. We will start by implementing a method that produces a TF-IDF vector for each document we were given. Each TF-IDF vector contains data on how "important" a given word was to that document.
To compute the TF-IDF vector for a single document (a webpage), we take that document and compute the TF-IDF score for each unique word. For each unique word, or term, we:
Compute TF: Term Frequency
The term frequency (TF) is a measure of how frequently a term appears in a document. We compute it using this formula:
$$\text{TF}(\text{term}, \text{document}) = \frac {\text{Number of times the term appears in doc}} {\text{Total number of words in a doc}}$$
Notice the more the term appears in the document, the higher the TF score. This hopefully makes intuitive sense: if the user's search query is "donuts" and if the word "donuts" appears many times in the webpage, then the webpage is probably all about donuts and is something the user will want to read.
IDF: Inverse Document Frequency
The inverse document frequency (IDF) measures how frequently a term appears in all documents using this formula. (Note: \(\ln(...)\) means "log base \(e\)".)
$$\text{IDF}(\text{term}) = \begin{cases} 0 & \text{if term doesn't appear in any doc} \\ \ln\left(\frac {\text{Total number of docs}} {\text{Number of docs containing the term}} \right) & \text{Otherwise} \end{cases} $$
Note that when computing the TF for a particular term, we'll obtain a different TF per each document. In contrast, when computing IDF, we do so by looking at every single document, rather then any single one.
Also notice that if a word appears in many different documents, the IDF score is smaller. This is because the more frequently a word appears, the less important it tends to be. For example, words like "the" or "of" appear in nearly every document – so, we don't really want it to be a distinguishing feature.
We take the natural log because it seems to work well in practice: statistically, only a small handful of words will be common across a majority of documents and we want to penalize those words more.
(If you'd like to learn more about the underlying theoretical justification for IDF's formula, see this research paper.)
Combined TF-IDF score
Finally, we compute the final TF-IDF relevance score for the term by multiplying the two above numbers together:
$$\text{Relevance}(\text{term}, \text{document}) = \text{TF}(\text{term}, \text{document}) \times \text{IDF}(\text{term})$$
We store each term and its corresponding TF-IDF score for a particular
document inside a IDictionary<String, Double>
. This
dictionary is the TF-IDF vector.
The TfIdfAnalyzer.computeAllDocumentTfIdf(...)
method
is responsible for computing the TF-IDF vector for every
document: it should return a IDictionary<URI, IDictionary<String, Double>>
.
We will store the resulting dictionary in a field for later use.
Let's try running through an example of computing TF-IDF in action.
Suppose our search engine contained the following three documents (a.k.a. three webpages):
Document A: "the mouse played with the cat"
Document B: "the quick brown fox jumped over the lazy dog"
Document C: "dog 1 and dog 2 ate the hot dog"
We compute the TF-IDF vector for each document like so:
Document A: "the mouse played with the cat"
$$ \begin{align*} \Relevance{the}{A} &= \expr{2}{6}{3}{3} = 0.0 \\ \Relevance{mouse}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \Relevance{played}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \Relevance{with}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \Relevance{cat}{A} &= \expr{1}{6}{3}{1} = 0.183102\\ \end{align*} $$
So, the final dictionary will look like
{"the"=0.0, "mouse"=0.183102, "played"=0.183102,
"with"=0.183102, "cat"=0.183102}
. We'll omit the
final dictionary for the other two examples to save space.
Document B: "the quick brown fox jumped over the lazy dog"
$$ \begin{align*} \Relevance{the}{B} &= \expr{2}{9}{3}{3} = 0.0 \\ \Relevance{quick}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{brown}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{fox}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{jumped}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{over}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{lazy}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{dog}{B} &= \expr{1}{9}{3}{2} = 0.045052\\ \end{align*} $$
Document C: "dog 1 and dog 2 ate the hot dog"
$$\begin{align*} \Relevance{dog}{C} &= \expr{3}{9}{3}{2} = 0.135155 \\ \Relevance{1}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{and}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{2}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{ate}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \Relevance{the}{C} &= \expr{1}{9}{3}{3} = 0.0 \\ \Relevance{hot}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\ \end{align*} $$
Ultimately, we'll obtain a unique TF-IDF vector for each document. We'll use these vectors in the next subpart.
Computing the TF and IDF scores for a single document can be somewhat expensive: not only do we need to loop through the entire document word-by-word to compute the TF, we need to loop through every single document in our search engine to compute the IDF.
To help make this more efficient, you should precompute as
much information as possible. For more details on how to do so, see the
comments in TfIdfAnalyzer
's constructor.
You will find that you need to loop over your lists and dictionaries
fairly frequently using their iterator. This can be a little inconvenient:
having to manually call .iterator()
and write a while loop
each time is clunky.
To help clean up your code, try using for-each loops instead. That is, instead of doing:
Iterator<KVPair<String, Double>> iter = someDict.iterator();
while (iter.hasNext()) {
KVPair<String, Double> pair = iter.next();
// ...
}
...you should instead do:
for (KVPair<String, Double> pair : someDict) {
// ...
}
These two code snippets are exactly equivalent, but the second version looks much cleaner (and requires typing fewer characters!).
When iterating over some data structure, do NOT attempt to simultaneously modify the same data structure!
Just reading from a data structure we're iterating over is completely ok, but writing to the data structure can end up breaking your code. For example:
for (KVPair<String, Double> pair : someDict) {
// Dangerous!
someDict.remove(pair.getKey());
someDict.put(someNewKey(), 0.0);
// Not as dangerous, since we're not adding a new key-value pair,
// but should still be avoided
someDict.put(pair.getKey(), pair.getValue() + 1);
// Completely safe -- nothing wrong with doing this
someDict.get(pair.getKey());
someDict.containsKey(someRandomKey());
}
The reason why you should never modify something you're iterating over like this is because we designed our iterators under the assumption that the client will never modify the data while an iterator is "live". If you break that precondition, you will get unpredictable results.
(Java's default data structures are a little more robust and will attempt to
raise a ConcurrentModificationException
if you try doing this: for the
sake of simplicity, we didn't implement this functionality in this class.)
What you should instead do is modify your algorithms so they "build up" new data structures. Loop over your old data structure, and add the key-value pairs you need to a different, initially empty, data structure.
Task: Finish implementing TfIdfAnalyzer.computeRelevance(...)
.
Next, we will implement the logic to actually determine how relevant an algorithm is to a query.
To compare a query against a document, we will need to perform three steps:
Find the TF-IDF vector for the document. This should be an easy, \(O(1)\) lookup since we already computed the TF-IDF vector for that document in part 2b.
Compute the TF-IDF vector for the query. You should be able to reuse the bulk of the code you wrote in the previous parts to do this (though you may have to do some mild refactoring).
Note: technically, we are treating the query as if it were a new document. However, you should not recompute the IDF values: just use the ones you computed earlier.
Compute the cosine similarity between the document vector and the query vector.
We will do this by comparing the TF-IDF vector for your query against all the document TF-IDF vectors and compute how similar your query vector is to each document by exploiting properties of the cosine function and the dot product.
However, since not everybody may have taken math classes about this kind of material, we will instead just provide you with the pseudocode for this material instead:
double computeRelevance(query, pageUri):
documentVector = look up document TF-IDF vector using pageUri
queryVector = compute query TF-IDF vector
numerator = 0.0
foreach word in query:
docWordScore = if documentVector.containsKey(word)
then documentVector.get(word)
else 0.0
queryWordScore = queryVector.get(word)
numerator += docWordScore * queryWordScore
denominator = norm(documentVector) * norm(queryVector)
if denominator != 0
then return numerator / denominator
else return 0.0
double norm(vector):
output = 0.0
foreach pair in vector:
score = pair.getValue()
output += score * score
return sqrt(output)
Make the pseudocode more efficient.
However, the pseudocode we gave you is not particularly efficient: you should be able to combine several loops and precompute some pieces of information needed by this method.
When implementing this method, you should add at least one other field
to your TfIdfAnalyzer
class containing information you can
precompute. Modify your constructor to precompute and store this information.
Task: Try running your search engine again.
After completing the previous parts, try running your search engine again on the different data folders.
As before, it may take some time for your search engine to finish loading all the pages and finish setting up your TF-IDF analyzer. However, once everything is set up, your search engine should be able to answer queries nearly instantaneously.
But unlike before, your engine should be doing much better this time! You should be getting back mostly reasonable results. The one exception should be if you select the "wikipedia-with-spam" data folder. If you try picking this folder, you may see links to a few spam websites interspersed with the wikipedia links we care about.
More specifically, we've inserted two spam webpages – one related to computer science, and another related to Seattle. Try running searches related to those two things! (E.g. try searching "computer science", "seattle", "seattle times", etc). Our fake spam pages won't necessarily be the very first result, but they should consistently appear near the top.
You should think about why that is. What are these webpages doing to artifically inflate their ranking results?
You'll implement an algorithm to deal with this problem in part 3 of this project.