Due Friday March 1 at 11:59pm.
In parts 2a and 2b, you will use the data structures and algorithms you implemented so far to build a basic search engine similar to Google Search or Microsoft Bing.
Note that we have NOT covered the algorithm for doing this in class: part of the challenge is figuring out how to implement new algorithms based mainly on a written description. If you're running into issues or have questions, we strongly encourage you to come to office hours.
You will use these files from prior assignments:
src/main/java/datastructures/concrete/dictionaries/ChainedHashDictionary.java
src/main/java/datastructures/concrete/dictionaries/ArrayDictionary.java
src/main/java/datastructures/concrete/ArrayHeap.java
src/main/java/datastructures/concrete/ChainedHashSet.java
src/main/java/datastructures/concrete/DoubleLinkedList.java
src/main/java/misc/Sorter.java
You will be modifying the following file:
src/main/java/search/analyzers/TfIdfAnalyzer.java
Additionally, here are a few more files that you might want to review while completing the assignment (note that this is just a starting point, not necessarily an exhaustive list):
src/main/java/search/models/Webpage.java
src/main/java/search/Main.java
A search engine is, in a nutshell, an algorithm that accepts a query from the user and computes how relevant a given document is to that query.
In our case, a query is whatever the user types into our search engine, and each individual web page is a single document.
Industrial strength search engines work by combining hundreds of different algorithms for computing relevance, but we will implement just two: Term Frequency and Inverse Document Frequency (TF-IDF) with Cosine Similarity ranking, and (in part 2b) PageRank.
Detailed descriptions of these algorithms are included later in this spec.
Our search engine won't be capable of searching the entire internet: we need to download a copy of every single webpage in existence to do that, which is probably not feasible.
Instead, we'll have our search engine work against a much smaller subset of the internet that we've downloaded for you.
Download this data.zip file and move it to the root directory of your HW5 repository. (This is a 360MB zip file, so download may take a while.)
Extract the zip file in the same folder, the root folder of the repository (extracting may also take a while).
After the zip file is extracted, you should see a data
folder in the same directory where you see the experimentdata
folder.
Git ignores the contents of the data
folder, so you will not see those files in Git staging area.
data
folder. This folder should contain
7 different folders – each one of these folders represents a different
"snapshot" or "fraction" of the internet.
When setting up your search engine, you must pick one of these "snapshots": your engine will search for only the documents listed in that folder. These snapshots vary in size. The smaller ones (such as 'gutenberg') folders are meant to help with testing, the larger ones (such as 'wikipedia') are meant to help with stress-testing.
Task: make sure you can run the search engine.
We will next make sure you can run the search engine so you can test your results as you go.
First, open the Main.java
file, which is located within
the search
package. The only thing you really need to care
in this file is the first DATA_FOLDER_NAME
constant at
the top of the class.
This constant must be the name of one of the folders in the
data
folder. You should change this constant when
testing your code, but for now, it's ok to leave it alone.
Next, run Main.java
(which is located inside the search
package). It may take a few seconds to a minute for the search engine to load the files,
depending on which data folder you're looking at.
Once it's ready, visit the link printed to the console in your web browser – it should be something like http://localhost:8080.
Hopefully, you should see a working website! Make sure everything works by running a few queries. Since we haven't implemented the actual search algorithm yet, the results will be in no particular order, but you should see some results appear nearly instantaneously.
While most of the data folders we give you should load in only a few seconds, the "wikipedia" and "wikipedia-with-spam" datasets may both take over a minute. The search engine, however, should be a little faster if you try running it again on the same data folder – it will cache some extra information within that folder to help speed up future runs.
That said, once everything is loaded, the search engine itself should be very fast, even with the Wikipedia datasets. If it isn't, that may be a sign that one of your data structures or algorithms is implement inefficiently somehow. If you ever find that it takes longer than a quarter second to half a second for any query to complete, you should treat that as a red flag.
You may also see some "malformed link exception" warnings on the first run. This is normal: those links genuinely are malformed. (Our parser insists that all links follow the full URL specification, and isn't as lenient about parsing links as web browsers typically are). You can ignore these warnings.
Task: Finish implementing TfIdfAnalyzer's
constructor (and all
associated helper methods).
The first ranking metric we will write is Term Frequency and Inverse Document Frequency (TF-IDF) ranking. We will start by implementing a method that produces a TF-IDF vector for each document we were given. Each TF-IDF vector contains data on how "important" a given word was to that document.
Before we begin with describing the exact algorithm we will be implementing, we want to figure out at a high level what we're planning on doing. If you just skip over this portion and start implementing the code, it will be hard to understand the full context for what you are implementing.
First, we need to understand what the TF-IDF vector really represents. Let's start with the TF portion of the TF-IDF vector. The term frequency simply represents how often different words show up on a given page. We will implement this in the context of a full page - simply put, the term frequency for a given word just amounts to the fraction of the page it is. If a web page consisted only of the words "correct horse battery staple", then the term frequency for any of of those words would be 1/4 or 0.25. If instead the page contained just the words "horse horse cow horse" we would want to know that horse is more "important" to the page; this would correspond to a term frequency of 0.25 for "cow" and 0.75 for "horse."
The inverse document frequency, on the other hand, is not used to describe just a single page; instead, it is used in the context of describing the words across all pages. The reason behind doing this is to identify how likely a word is to actually be important - a word like "the" is likely to appear in pretty much every (english) webpage, whereas a word like "ganglioneuralgia" will only really come up in a very small number of fancy website about ice cream induced brain freezes (and webpages about weird medical terms). The intuition behind the inverse document frequency is to account for this; we want to basically ignore the word "the" most of the time, but care about "rarer" words much more, so we just count how many pages each word shows up in to see how "rare" it is. This number is used in the formula described in the next section so that words that show up pretty much everywhere end up having an IDF score of about ln(1), which comes out to about 0, and words that are a bit more rare will be assigned the natural log of a much larger value - which will ultimately make them have a higher overall relevance to the search.
So how do we actually use these? We want to combine the two factors factors, and multiplying them together is a pretty reasonable choice; that way, words with a higher frequency showing up more due to the TF being higher, and words that are likely less important to the search will be (because they show up in most every page) basically eliminated from the search (as the IDF for that word will be very low). In part 2d, we will be using a user's query (using approximately the same process to generate a new TF-IDF vector for it) to identify which documents are the 'most relevant' to it.
To compute the TF-IDF vector for a single document (a webpage), we take that document and compute the TF-IDF score for each unique word. For each unique word, or term, we:
We store each term and its corresponding TF-IDF score for a particular
document inside a IDictionary<String, Double>
. This
dictionary is the TF-IDF vector.
The TfIdfAnalyzer.computeAllDocumentTfIdf(...)
method
is responsible for computing the TF-IDF vector for every
document: it should return a IDictionary<URI, IDictionary<String, Double>>
.
We will store the resulting dictionary in a field for later use.
Let's try running through an example of computing TF-IDF in action.
Suppose our search engine contained the following three documents (a.k.a. three webpages):
Document A: "the mouse played with the cat"
Document B: "the quick brown fox jumped over the lazy dog"
Document C: "dog 1 and dog 2 ate the hot dog"
We compute the TF-IDF vector for each document like so:
Ultimately, we'll obtain a unique TF-IDF vector for each document. We'll use these vectors in the next subpart.
Task: Finish implementing TfIdfAnalyzer.computeRelevance(...)
.
After finishing the last part, we can compute a TF-IDF vector for each document that encodes the importance of each word to that document. Now, we want to compute the TF-IDF vectors for each document and the search query, and then compare them to determine how relevant each document is to the query. To do this, we will use cosine similarity, a method of quantifying how close two vectors are.
To compare a query against a document, we will need to perform four steps:
Find the TF-IDF vector for the document. This should be an easy, \(O(1)\) lookup since we already computed the TF-IDF vector for that document in part 2b.
Compute the TF-IDF vector for the query. You should be able to reuse the bulk of the code you wrote in the previous parts to do this (though you may have to do some mild refactoring).
Note: technically, we are treating the query as if it were a new document. However, you should not recompute the IDF values: just use the ones you computed earlier.
Compute the cosine similarity between the document vector and the query vector. You'll find pseudocode below.
Make the pseudocode more efficient.
This pseudocode is not particularly efficient: you should be able to combine several loops and precompute some pieces of information needed by this method.
When implementing this method, you should add at least one other field
to your TfIdfAnalyzer
class containing information you can
precompute. Modify your constructor to precompute and store this information.
Here's the pseudocode for cosine similarity
double computeRelevance(query, pageUri):
documentVector = look up document TF-IDF vector using pageUri
queryVector = compute query TF-IDF vector
numerator = 0.0
foreach unique word in query:
docWordScore = if documentVector.containsKey(word)
then documentVector.get(word)
else 0.0
queryWordScore = queryVector.get(word)
numerator += docWordScore * queryWordScore
denominator = norm(documentVector) * norm(queryVector)
if denominator != 0
then return numerator / denominator
else return 0.0
double norm(vector):
output = 0.0
foreach pair in vector:
score = pair.getValue()
output += score * score
return sqrt(output)
Task: Try running your search engine again.
After completing the previous parts, try running your search engine again on the different data folders.
As before, it may take some time for your search engine to finish loading all the pages and finish setting up your TF-IDF analyzer. However, once everything is set up, your search engine should be able to answer queries nearly instantaneously.
But unlike before, your engine should be doing much better this time! You should be getting back mostly reasonable results. The one exception should be if you select the "wikipedia-with-spam" data folder. If you try picking this folder, you may see links to a few spam websites interspersed with the wikipedia links we care about.
More specifically, we've inserted two spam webpages – one related to computer science, and another related to Seattle. Try running searches related to those two things! (E.g. try searching "computer science", "seattle", "seattle times", etc). Our fake spam pages won't necessarily be the very first result, but they should consistently appear near the top.
You should think about why that is. What are these webpages doing to artificially inflate their ranking results?
You'll implement an algorithm to deal with this problem in part 2b of this project.