Overview
Due Wednesday, May 22 at 11:59pm.
In parts 2a and 2b, you will use the data structures and algorithms
you implemented so far to build a basic search engine similar to
Google Search or Microsoft Bing.
Note that we have NOT covered the algorithm for
doing this in class: part of the challenge is figuring out how to
implement new algorithms based mainly on a written description. If you're
running into issues or have questions, we strongly encourage you to
come to office hours.
You will use these files from prior assignments:
src/main/java/datastructures/concrete/dictionaries/ChainedHashDictionary.java
src/main/java/datastructures/concrete/dictionaries/ArrayDictionary.java
src/main/java/datastructures/concrete/ArrayHeap.java
src/main/java/datastructures/concrete/ChainedHashSet.java
src/main/java/datastructures/concrete/DoubleLinkedList.java
src/main/java/misc/Sorter.java
You will be modifying the following file:
src/main/java/search/analyzers/TfIdfAnalyzer.java
Additionally, here are a few more files that you might want to review
while completing the assignment (note that this is just a starting point,
not necessarily an exhaustive list):
src/test/java/search/TestTfIdfAnalyzer.java
src/main/java/search/models/Webpage.java
src/main/java/search/Main.java
High-level context
A search engine is, in a nutshell, an algorithm that accepts a
query from the user and computes how relevant a
given document is to that query.
In our case, a query is whatever the user types into our search
engine, and each individual web page is a single document.
Industrial strength search engines work by combining hundreds of
different algorithms for computing relevance, but we will implement
just two: Term Frequency and Inverse Document Frequency (TF-IDF) with
Cosine Similarity ranking, and (in part 2b) PageRank.
Detailed descriptions of these algorithms are included later in
this spec.
Part 2a i: Getting webpages
Our search engine won't be capable of searching the
entire internet: we need to download a copy of every single webpage
in existence to do that, which is probably not feasible.
Instead, we'll have our search engine work against a much smaller
subset of the internet that we've downloaded for you.
Download this data.zip file and move it to the root directory of your HW5 repository. (This is a 175MB zip file, so download may take a while.)
Extract the zip file in the same folder, the root folder of the repository (extracting may also take a while).
After the zip file is extracted, you should see a data
folder in the same directory where you see the experimentdata
folder.
Git ignores the contents of the data
folder, so you will not see those files in Git staging area.
Take a look inside the data
folder. This folder should contain
8 different folders—each one of these folders represents a different
"snapshot" or "fraction" of the internet.
When setting up your search engine, you must pick one of
these "snapshots": your engine will search for only the documents listed
in that folder. These snapshots vary in size. The smaller ones
(such as 'gutenberg') folders are meant to help with testing,
the larger ones (such as 'wikipedia') are meant to help with stress-testing.
Part 2a ii: Run the search engine
Task: make sure you can run the search engine.
We will next make sure you can run the search engine
so you can test your results as you go.
First, open the Main.java
file, which is located within
the search
package. The only thing you really need to care
in this file is the first DATA_FOLDER_NAME
constant at
the top of the class.
This constant must be the name of one of the folders in the
data
folder. You should change this constant when
testing your code, but for now, it's ok to leave it alone.
Next, run Main.java
(which is located inside the search
package). It may take a few seconds to a minute for the search engine to load the files,
depending on which data folder you're looking at.
Once it's ready, visit the link
printed to the console in your web browser—it should be
something like http://localhost:8080.
Hopefully, you should see a working website! Make sure everything
works by running a few queries. Since we haven't implemented the
actual search algorithm yet, the results will be in no particular order,
but you should see some results appear nearly instantaneously.
Expectations on loading and querying speed
While most of the data folders we give you should load in only a few seconds,
the "wikipedia" and "wikipedia-with-spam" datasets may both take over a minute.
The search engine, however, should be a little faster if you try running it again
on the same data folder—it will cache some extra information within that
folder to help speed up future runs.
That said, once everything is loaded, the search engine itself should be
very fast, even with the Wikipedia datasets. If it isn't, that may be a sign
that one of your data structures or algorithms is implement inefficiently somehow.
If you ever find that it takes longer than a quarter second to half a second for any
query to complete, you should treat that as a red flag.
You may also see some "malformed link exception" warnings on the first run.
This is normal: those links genuinely are malformed. (Our parser insists
that all links follow the full URL specification, and isn't as lenient about
parsing links as web browsers typically are). You can ignore these warnings.
For some of the larger datasets ("wikipedia" and "wikipedia-with-spam"), you may
run into out-of-memory issues or very slow loading times—specifically, you
may see Java throw an java.lang.OutOfMemoryError
.
This error can occur because our search engine will attempt to load all of
the web pages in the dataset into memory—if we want to load many data pages,
we may end up needing to allocate more memory than Java permits by default.
To fix this in IntelliJ, do the following:
- With your homework 5 project open, open the dropdown menu at the top right of the
IDE, next to the run and debug buttons.
- Click Edit Configurations..., then select Application >
Main in the left-side menu. (The option for Main may not appear if you
have not tried to run
Main.java
yet.)
Enter -Xmx5G
in the VM options text box. This will allow IntelliJ to
grow the heap size up to 5 GB of memory.
- Click OK to save the setting.
The URL you're asked to visit ("http://localhost:8080")
may look a little different than what you're used to.
First, notice that we're using the word "localhost" instead
of the name of some website. This is because we're running
your search engine locally on your computer: the word
"localhost" is a shorthand for "this computer".
Next, we add a colon and some numbers to the end. Those
numbers are a port. In short, every program that
wants to communicate over the network (games, your email
client, chat clients...) must select a unique
port number. Then, when somebody wants to send data to a
particular program, they need to include that port number
with the data. That way, when the operating recieves the
data packet, it knows which program to send that packet to.
Websites, by convention, will use specific ports (port 80
for HTTP websites and port 443 for HTTPS), but using those
ports typically requires special permission from the operating
system.
(This is why we can normally omit the port number when
typing web addresses: our web browsers are smart enough to
automatically use one of these two default port numbers.)
This can be a hassle, so we configured our search engine
to use an arbitrary port number (8080) that doesn't require
permissions from your operating system to use and that (hopefully)
no other program is currently using.
Part 2a iii: Implement code to generate a TF-IDF vector
Task: Finish implementing TfIdfAnalyzer's
constructor (and all
associated helper methods).
The first ranking metric we will write is Term Frequency and
Inverse Document Frequency (TF-IDF) ranking. We will start by
implementing a method that produces a TF-IDF vector
for each document we were given. Each TF-IDF vector contains data on
how "important" a given word was to that document.
At a high level...
Before we begin with describing the exact algorithm we will be implementing, we want
to figure out at a high level what we're planning on doing. If you just skip over this portion
and start implementing the code, it will be hard to understand the full context for what you are
implementing.
First, we need to understand what the TF-IDF vector really represents. Let's start with the
TF portion of the TF-IDF vector. The term frequency simply represents how often different words
show up on a given page. We will implement this in the context of a full page - simply put, the
term frequency for a given word just amounts to the fraction of the page it is. If a web page
consisted only of the words "correct horse battery staple", then the term frequency for any of
of those words would be 1/4 or 0.25. If instead the page contained just the words "horse
horse cow horse" we would want to know that horse is more "important" to the page; this would
correspond to a term frequency of 0.25 for "cow" and 0.75 for "horse."
The inverse document frequency, on the other hand, is not used to describe just a single page;
instead, it is used in the context of describing the words across all pages. The reason
behind doing this is to identify how likely a word is to actually be important - a word like "the"
is likely to appear in pretty much every (english) webpage, whereas a word like "ganglioneuralgia"
will only really come up in a very small number of fancy website about ice cream induced brain
freezes (and webpages about weird medical terms). The intuition behind the inverse document frequency
is to account for this; we want to basically ignore the word "the" most of the time, but care about
"rarer" words much more, so we just count how many pages each word shows up in to see how "rare" it is.
This number is used in the formula described in the next section so that words that show up pretty
much everywhere end up having an IDF score of about ln(1), which comes out to about 0, and words that
are a bit more rare will be assigned the natural log of a much larger value - which will ultimately
make them have a higher overall relevance to the search.
So how do we actually use these? We want to combine the two factors factors, and multiplying
them together is a pretty reasonable choice; that way, words with a higher frequency showing up
more due to the TF being higher, and words that are likely less important to the search will be
(because they show up in most every page) basically eliminated from the search (as the IDF for
that word will be very low). In part 2d, we will be using a user's query (using approximately
the same process to generate a new TF-IDF vector for it) to identify which documents are
the 'most relevant' to it.
Core algorithm
To compute the TF-IDF vector for a single document (a webpage),
we take that document and compute the TF-IDF score for each
unique word. For each unique word, or term, we:
-
The term frequency (TF) is a measure of how frequently a term appears
in a document. We compute it using this formula:
$$\text{TF}(\text{term}, \text{document}) = \frac
{\text{Number of times the term appears in doc}}
{\text{Total number of words in a doc}}$$
Notice the more the term appears in the document, the higher
the TF score. This hopefully makes intuitive sense: if the user's
search query is "donuts" and if the word "donuts" appears many
times in the webpage, then the webpage is probably all about
donuts and is something the user will want to read.
-
The inverse document frequency (IDF) measures how frequently a
term appears in all documents using this
formula. (Note: \(\ln(...)\) means "log base \(e\)".)
$$\text{IDF}(\text{term}) = \begin{cases}
0 & \text{if term doesn't appear in any doc} \\
\ln\left(\frac
{\text{Total number of docs}}
{\text{Number of docs containing the term}}
\right)
& \text{Otherwise}
\end{cases}
$$
Note that when computing the TF for a particular term, we'll
obtain a different TF per each document. In contrast, when computing
IDF, we do so by looking at every single document, rather than any
single one.
Also notice that if a word appears in many different documents,
the IDF score is smaller. This is because the more frequently a word
appears, the less important it tends to be. For example, words like
"the" or "of" appear in nearly every document—so, we don't
really want it to be a distinguishing feature.
We take the natural log because it seems to work well in practice:
statistically, only a small handful of words will be common across
a majority of documents and we want to penalize those words more.
(If you'd like to learn more about the underlying theoretical
justification for IDF's formula, see
this research paper.)
-
Finally, we compute the final TF-IDF relevance score for the
term by multiplying the two above numbers together:
$$\text{Relevance}(\text{term}, \text{document}) =
\text{TF}(\text{term}, \text{document}) \times \text{IDF}(\text{term})$$
We store each term and its corresponding TF-IDF score for a particular
document inside a IDictionary<String, Double>
. This
dictionary is the TF-IDF vector.
The TfIdfAnalyzer.computeAllDocumentTfIdf(...)
method
is responsible for computing the TF-IDF vector for every
document: it should return a IDictionary<URI, IDictionary<String, Double>>
.
We will store the resulting dictionary in a field for later use.
Example
Let's try running through an example of computing TF-IDF in action.
Suppose our search engine contained the following three documents (a.k.a. three
webpages):
Document A: "the mouse played with the cat"
Document B: "the quick brown fox jumped over the lazy dog"
Document C: "dog 1 and dog 2 ate the hot dog"
We compute the TF-IDF vector for each document like so:
$$
\newcommand{\Relevance}[2]{\text{Relevance}(\text{#1},\, \text{Doc}_{#2})}
\newcommand{\f}[2]{\frac{#1}{#2}}
\newcommand{\logp}[1]{\ln\left(#1\right)}
\newcommand{\expr}[4]{\f{#1}{#2} \times \logp{\f{#3}{#4}}}
$$
-
Document A: "the mouse played with the cat"
$$
\begin{align*}
\Relevance{the}{A} &= \expr{2}{6}{3}{3} = 0.0 \\
\Relevance{mouse}{A} &= \expr{1}{6}{3}{1} = 0.183102\\
\Relevance{played}{A} &= \expr{1}{6}{3}{1} = 0.183102\\
\Relevance{with}{A} &= \expr{1}{6}{3}{1} = 0.183102\\
\Relevance{cat}{A} &= \expr{1}{6}{3}{1} = 0.183102\\
\end{align*}
$$
So, the final dictionary will look like
{"the"=0.0, "mouse"=0.183102, "played"=0.183102,
"with"=0.183102, "cat"=0.183102}
. We'll omit the
final dictionary for the other two examples to save space.
-
Document B: "the quick brown fox jumped over the lazy dog"
$$
\begin{align*}
\Relevance{the}{B} &= \expr{2}{9}{3}{3} = 0.0 \\
\Relevance{quick}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\
\Relevance{brown}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\
\Relevance{fox}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\
\Relevance{jumped}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\
\Relevance{over}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\
\Relevance{lazy}{B} &= \expr{1}{9}{3}{1} = 0.122068 \\
\Relevance{dog}{B} &= \expr{1}{9}{3}{2} = 0.045052\\
\end{align*}
$$
-
Document C: "dog 1 and dog 2 ate the hot dog"
$$\begin{align*}
\Relevance{dog}{C} &= \expr{3}{9}{3}{2} = 0.135155 \\
\Relevance{1}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\
\Relevance{and}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\
\Relevance{2}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\
\Relevance{ate}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\
\Relevance{the}{C} &= \expr{1}{9}{3}{3} = 0.0 \\
\Relevance{hot}{C} &= \expr{1}{9}{3}{1} = 0.122068 \\
\end{align*}
$$
Ultimately, we'll obtain a unique TF-IDF vector for each
document. We'll use these vectors in the next subpart.
Other notes and hints
-
Computing the TF and IDF scores for a single document can be somewhat
expensive: not only do we need to loop through the entire document
word-by-word to compute the TF, we need to loop through every single
document in our search engine to compute the IDF.
To help make this more efficient, you should precompute as
much information as possible. For more details on how to do so, see the
comments in TfIdfAnalyzer
's constructor.
-
You will find that you need to loop over your lists and dictionaries
fairly frequently using their iterator. This can be a little inconvenient:
having to manually call .iterator()
and write a while loop
each time is clunky.
To help clean up your code, try using for-each loops instead. That is,
instead of doing:
Iterator<KVPair<String, Double>> iter = someDict.iterator();
while (iter.hasNext()) {
KVPair<String, Double> pair = iter.next();
// ...
}
...you should instead do:
for (KVPair<String, Double> pair : someDict) {
// ...
}
These two code snippets are exactly equivalent, but the second version looks
much cleaner (and requires typing fewer characters!).
-
When iterating over some data structure, do NOT attempt
to simultaneously modify the same data structure!
Just reading from a data structure we're iterating over is
completely ok, but writing to the data structure can end up
breaking your code. For example:
for (KVPair<String, Double> pair : someDict) {
// Dangerous!
someDict.remove(pair.getKey());
someDict.put(someNewKey(), 0.0);
// Not as dangerous, since we're not adding a new key-value pair,
// but should still be avoided
someDict.put(pair.getKey(), pair.getValue() + 1);
// Completely safe -- nothing wrong with doing this
someDict.get(pair.getKey());
someDict.containsKey(someRandomKey());
}
The reason why you should never modify something you're iterating over like this
is because we designed our iterators under the assumption that the client will never
modify the data while an iterator is "live". If you break that precondition, you
will get unpredictable results.
(Java's default data structures are a little more robust and will attempt to
raise a ConcurrentModificationException
if you try doing this: for the
sake of simplicity, we didn't implement this functionality in this class.)
What you should instead do is modify your algorithms so they "build up" new
data structures. Loop over your old data structure, and add the key-value pairs
you need to a different, initially empty, data structure.
-
When using dictionaries to continually count things or store single keys mapped to multiple values, there is a common pattern that appears. Consider the code below that counts the number of occurrences of any character in a String.
String s = ...
... code to compute the String s above ...
IDictionary<Character, Integer> counts = new ArrayDictionary<>();
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (!counts.containsKey(c)) {
counts.put(c, 1);
} else {
counts.put(c, counts.get(c) + 1);
}
}
Instead of writing an if else and potentially computing both containsKey and get, we can use getOrDefault here to clean up our code. Below is a code snippet that has equivalent behavior to the code snippet above.
String s = ...
... code to compute the String s above ...
IDictionary<Character, Integer> counts = new ArrayDictionary<>();
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
counts.put(c, counts.getOrDefault(c, 0) + 1);
}
If you implemented getOrDefault correctly for your ArrayDictionary and ChainedHashDictionary, using getOrDefault in these types of situations should not only help clean up the code's logic, but should also help efficiency.
-
You should seek to minimize the number of floating-point computations that take
place. If you continually apply arithmetic operations with floating point
numbers, you are likely to accumulate small errors due to way floating point
numbers are stored; these errors become particularly apparent when working with
small numbers. To make sure your values are similar to the expected results,
you should try to factor out division, multiplication, and other operations as
much as possible; for example, if you have something like the following in your
solution:
double sum = 0.0;
for (int i = 0; i < list.size(); i++) {
sum += ((double) list.get(i)) / list.size();
}
you divide each element in the list by its size. Instead, you could refactor that
code to the following:
double sum = 0.0;
for (int i = 0; i < list.size(); i++) {
sum += list.get(i);
}
sum /= list.size();
This reduces the error by both reducing the number of divisions used to produce
the final value and increasing the values of the numbers being summed
(since very small numbers are harder to accurately represent as floating
point numbers).
To account for floating-point error when comparing numbers during testing, we use
a version of the assertEquals
method that accepts a
delta
parameter; this delta
value specifies the
maximum difference between two numbers that should be considered "equal."
Part 2a iv: Implement cosine similarity
Task: Finish implementing TfIdfAnalyzer.computeRelevance(...)
.
After finishing the last part, we can compute a TF-IDF vector for each document that encodes the importance of each word to that document. Now, we want to compute the TF-IDF vectors for each document and the search query, and then compare them to determine how relevant each document is to the query. To do this, we will use cosine similarity, a method of quantifying how close two vectors are.
Core logic
To compare a query against a document, we will need to perform
four steps:
Find the TF-IDF vector for the document. This should
be an easy, \(O(1)\) lookup since we already computed the
TF-IDF vector for that document in part 2b.
Compute the TF-IDF vector for the query. You should
be able to reuse the bulk of the code you wrote in the previous parts
to do this (though you may have to do some mild refactoring).
Note: technically, we are treating the query as if it were
a new document. However, you should not recompute the IDF values:
just use the ones you computed earlier.
Compute the cosine similarity between the document
vector and the query vector. You'll find pseudocode below.
Make the pseudocode more efficient.
This pseudocode is not particularly efficient:
you should be able to combine several loops and precompute some pieces of
information needed by this method.
When implementing this method, you should add at least one other field
to your TfIdfAnalyzer
class containing information you can
precompute. Modify your constructor to precompute and store this information.
Here's the pseudocode for cosine similarity
double computeRelevance(query, pageUri):
documentVector = look up document TF-IDF vector using pageUri
queryVector = compute query TF-IDF vector
numerator = 0.0
foreach unique word in query:
docWordScore = if documentVector.containsKey(word)
then documentVector.get(word)
else 0.0
queryWordScore = queryVector.get(word)
numerator += docWordScore * queryWordScore
denominator = norm(documentVector) * norm(queryVector)
if denominator != 0
then return numerator / denominator
else return 0.0
double norm(vector):
output = 0.0
foreach pair in vector:
score = pair.getValue()
output += score * score
return sqrt(output)
Ok, here's what's actually happening.
The document and query vectors we're computing are in
reality vectors of length \(n\), where \(n\) is the number
of unique words across all the documents and the current
query. Each "index" of the vector corresponds to a specific
word and stores the TF-IDF score for that term.
We represent this vector using a dictionary because our
vector is likely to be sparse: if we have many
documents, an arbitrary vector is likely to contain
zero scores for most of the \(n\) words.
On a higher level, what we've basically done is figured
out how to convert our (text) document and query into
mathematical vectors. We can now apply loads of interesting
mathematical properties to manipulate this representation of
our document in query.
In particular, we're interested in trying to measure
how "similar" two vectors are. We'll do so by attempting
to find the "angle" between our query and document vectors:
the smaller the angle, the more similar the two are.
To do this, we can take advantage of the following
identity:
$$
\textbf{a} \cdot \textbf{b} = ||\textbf{a}|| \, ||\textbf{b}|| \cos(\theta)
$$
...where \(\textbf{a}\) and \(\textbf{b}\) are vectors,
the \(\cdot\) operation is the dot product, and
\(||\textbf{a}||\) is the magnitude of the vector (we take
the Euclidean norm here).
We then rearrange and compute the following:
$$
\cos(\theta) = \frac{\textbf{a} \cdot \textbf{b}}{||\textbf{a}|| \, ||\textbf{b}||}
$$
We then use the computed value for \(\cos(\theta)\) as our
similarity score.
We don't bother taking the arccosine of this quantity
because it'll just perform the same transformation on all
of our scores, which is a waste of time. We also
get to preserve the property that similar vectors
will have a higher score, and that all scores will fall
between 0.0 and 1.0 (since every score in our vector is
positive).
We must, however, handle one edge case: where one or both
of the norms are equal to zero. This happens when the user
query uses completely unique words that were never before seen
in any document, for example. In this case, the output of the
cosine similarity metric is undefined, since we're dividing by
zero. However, if the query contains no previously seen words,
we're not going to be able to classify anything anyways, so we
effectively give up and just return 0.0 as a placeholder value.
Part 2a v: Re-run your search engine
Task: Try running your search engine again.
After completing the previous parts, try running your search engine again
on the different data folders.
As before, it may take some time for your search engine to finish loading
all the pages and finish setting up your TF-IDF analyzer. However, once
everything is set up, your search engine should be able to answer queries nearly
instantaneously.
But unlike before, your engine should be doing much better this
time! You should be getting back mostly reasonable results.
The one exception should be if you select the "wikipedia-with-spam" data
folder. If you try picking this folder, you may see links to a few spam
websites interspersed with the wikipedia links we care about.
More specifically, we've inserted two spam webpages—one related
to computer science, and another related to Seattle. Try running searches
related to those two things! (E.g. try searching "computer science", "seattle",
"seattle times", etc). Our fake spam pages won't necessarily be the very first
result, but they should consistently appear near the top.
You should think about why that is. What are these webpages
doing to artificially inflate their ranking results?
You'll implement an algorithm to deal with this problem in
part 2b of this project.