Useful CSE 163 Resources¶
Learning objective: Implement specialized data types with Python classes for tf-idf information retrieval.
-
search_engine.py
is the file representing theSearchEngine
class as well as a command-line interface for using theSearchEngine
. -
document.py
is the file representing a single document in theSearchEngine
. -
hw4_test.py
is the file for you to put your own tests. The Run button executes this program. -
server.py
andindex.html
provide a web app for you to run your completed search engine. You do not need to look at or modify these files ever. -
cse163_utils.py
is a helper file that has code to help you test your code.
Warning
This assessment is very information-dense. Plan ahead by taking notes of what you need to do and where you will need to do it. It will probably help to get a global view of the entire assessment by reading all of the components before starting any coding—we expect students to spend at least 40 minutes reading, synthesizing, and planning before starting any coding.
Context¶
A search engine is an algorithm that takes a query and retrieves the most relevant documents for that query. In order to identify the most relevant documents, our search engine will use term frequency–inverse document frequency (tf–idf), a text information statistic for determining the relevance of a term to each document from a corpus consisting of many documents.
The tf-idf statistic consists of two components: term frequency and inverse document frequency. Term frequency computes the number of times that a term appears in a document (such as a single Wikipedia page). If we were to use only the term frequency in determining the relevance of a term to each document, then our search result might not be helpful since most documents contain many common words such as “the” or “a”. In order to down-weight these common terms, the document frequency computes the number of times that a term appears across the corpus of all documents. The tf-idf statistic takes a term and a document and returns the term frequency divided by the document frequency.
In this assessment, we’ll implement two classes: a Document
class to represent individual web pages and a SearchEngine
class that aggregates the corpus of all Document
objects. The SearchEngine
builds on the Document
, so be sure to fully complete the Document
class before moving onto the SearchEngine
.
Document
¶
The Document
class defined in document.py
represents the data in a single web page and includes methods to compute term frequency. (But not document frequency since that would require access to all of the documents in the corpus.)
Task: Write an initializer __init__
that takes a path to a document and initializes the document data. Assume that the file exists, but that it could be empty. In order to implement term_frequency
later, we’ll need to precompute the term frequency for each term in the document in the initializer by constructing a dictionary that maps each str
term to its float
term frequency in the given document.
Consider the term frequencies for this short document containing 4 total words.
the cutest cutest dog
-
'the'
appears 1 time out of 4 total words, so the is 0.25. -
'cutest'
appears 2 times out of 4 total words, so the is 0.5. -
'dog'
appears 1 time out of 4 total words, so the is 0.25.
When constructing this dictionary, normalize all terms by lowercasing text and ignoring punctuation so that 'corgi'
, 'CoRgi'
, and 'corgi!!'
are considered the same. We have provided a function called normalize_token
in cse163_utils.py
that you can import to normalize a single token from a string. For example, the following code cells show how to use the function.
token = 'CoRgi!!'
token = normalize_token(token)
print(token) # corgi
token = '<div>Hi!</div>'
token = normalize_token(token)
print(token) # divhidiv
Info
If you’re familiar with HTML, you might have noticed that a lot of the files in our provided small_wiki
corpus contain HTML code, including tags that look like <div>
. Don’t worry about handling the html. Sticking with what we provide above for normalization is enough.
Task: Write a method term_frequency
that returns the term frequency of a given term by looking it up in the precomputed dictionary. Remember to normalize the term. If a term does not appear in a given document, it should have a term frequency of 0.
Task: Write a method get_path
that returns the path of the file that this document represents.
Task: Write a method get_words
that returns a list
of the unique, normalized words in this document.
Task: Write a method __repr__
that returns a string representation of this document in the format "Document({doc_path}, words: {num_words})"
where doc_path
is the path to the document given in the Document
initializer and where num_words
is the total number of words in the document. __repr__
is a special “magic method” that will be called if you try to print
an instance of your class.
Below is an example of expected behavior. Notice how the doc_path
is the same as the one passed in to the initializer.
d = Document("/course/small_wiki/William McKinley - Wikipedia.html")
print(d)
# should print out
# Document(/course/small_wiki/William McKinley - Wikipedia.html, words: 34599)
Warning
Be careful when writing assert_equals
tests for __repr__
. You need to call the __repr__
function explicitly, like below. If you don’t, you will get an AssertionError
that is really quite unhelpful.
doc = Document("/files/dogs.txt")
# option 1
assert_equals("Document(/files/dogs.txt, words: 4)", doc.__repr__())
# option 2
assert_equals("Document(/files/dogs.txt, words: 4)", repr(doc))
Task: Write at least 3 assert_equals
tests for Document
in hw4_test.py
using your own test corpus.
Hint
Create new files in your workspace for each additional test case. When specifying file names, use absolute paths such as /home/song.txt
.
Warning
Bugs in the Document
implementation will really hurt when implementing SearchEngine
. Make sure to fully test your Document
before moving on.
SearchEngine
¶
The SearchEngine
class defined in search_engine.py
represents a corpus of Document
objects and includes methods to compute the tf–idf statistic between each document and a given query.
Task: Write an initializer that takes a str
path to a directory such as /course/small_wiki/
and constructs an inverted index associating each term in the corpus to the list of documents that contain the term. Assume the string represents a valid directory, and that the directory contains only valid files. Do not recreate any behavior that is already done in the Document
class—call the Document.get_words
method! Create at most one Document
object for each file.
In order to implement search
later, it will be necessary to find all documents which contain the given term. The initializer for SearchEngine
should precompute the inverted index, a dictionary associating each str
term to the list
of Document
objects that include the term. Consider this demonstration corpus of 3 str
documents and the inverted index for the corpus.
doc1 = Document("/home/files/corgis.txt") # File contents: 'I love corgis'
doc2 = Document("/home/files/puppies.txt") # File contents: 'I love puppies'
doc3 = Document("/home/files/dogs.txt") # File contents: 'I love dogs'
inverted_index = {
'i': [doc1, doc2, doc3],
'love': [doc1, doc2, doc3],
'corgis': [doc1],
'puppies': [doc2],
'dogs': [doc3]
}
The inverted index will help in implementing the search
method below by providing a way to answer the question, “Which documents contain the term, 'corgis'
?”
To iterate over all the files in a directory, call os.listdir
to list all the file names and join the path with os.path.join
.
Example:
import os
directory = '/course/small_wiki/'
for filename in os.listdir(directory):
print(os.path.join(directory, filename))
Task: Write a method _calculate_idf
that takes a str
term and returns the inverse document frequency of that term. If the term is not in the corpus, return 0. Otherwise, if the term is in the corpus, compute the inverse document frequency as follows.
Call math.log
to compute the natural logarithm of a given number.
Info
_calculate_idf
is a private method, which should not be called by the client. Because it’s defined in the SearchEngine
class, you should use it to help you implement the search
function. We will expect that you have this private method with the behavior described so that we can test your code.
Task: Write a method __repr__
that returns a string representation of this SearchEngine
in the format "SearchEngine({path}, size: {num_docs})"
where path
is the path to the directory given in the initializer and where num_docs
is the total number of unique documents in the corpus.
Below are examples of expected behavior. Notice how the path
that’s printed out by __repr__
is always the same as the one passed into the constructor.
se = SearchEngine("/course/small_wiki")
print(se) # SearchEngine(/course/small_wiki, size: 70)
se2 = SearchEngine("/course/small_wiki/")
print(se2) # SearchEngine(/course/small_wiki/, size: 70)
Task: Write a method search
that takes a str
query that contains one or more terms. The search
method returns a list
of document paths sorted in descending order by tf–idf statistic. Normalize the terms before processing. If there are no matching documents, return an empty list.
Subgoal: Start by implementing the search
method for queries that contain only a single term. To generate a ranked list of documents, first collect all the documents that contain the given term. Then, compute the tf–idf statistic for each document.
Store these (Document, float)
pairs as a list of tuples. Finally, return the sorted
list of documents in descending order according to the tf–idf statistic.
Info
For this assessment, you do not need to worry about tiebreaker cases where two documents have the same tf-idf. Consequently, your assert_equals
tests should not have ties. If you encounter a tiebreaker situation it might be an indicator that your search algorithm is incorrectly computing scores for each document!
Subgoal: Start writing your first assert_equals
test for search
with a single term in hw4_test.py
using your own test corpus.
Subgoal: Extend the search
method to queries that contain multiple terms. The output of a multi-term query are all the documents that match at least one term in the query. The tf–idf statistic for multi-term queries is the sum of the single-term tf–idf statistics.
Finding the relevant documents for a multi-word query is a bit more challenging. Instead of looking at a single entry in the dictionary, we must look at all Document
objects that contain at least one word in the query.
Task: Write at least 3 assert_equals
tests for SearchEngine
in hw4_test.py
using your own test corpus.
SearchEngine.search
¶
Let’s walk through a full example of the search
process. Say we have a corpus named 'other_files'
containing 3 documents with the following contents:
-
/home/other_files/doc1
—Dogs are the greatest pets.
-
/home/other_files/doc2
—Cats seem pretty okay
-
/home/other_files/doc3
—I love dogs!
This corpus would have the following inverted index
{
"dogs": [doc1, doc3],
"are": [doc1],
"the": [doc1],
"greatest": [doc1],
"pets": [doc1],
"cats": [doc2],
"seem": [doc2],
"pretty": [doc2],
"okay": [doc2],
"i": [doc3],
"love": [doc3],
}
Searching this corpus for the query 'love dogs'
returns a list in the order ['/home/other_files/doc3', '/home/other_files/doc1']
.
-
Find all matching documents with at least one query word.
doc3
contains the word'love'
while bothdoc1
anddoc3
contain the word'dogs'
. Bothdoc1
anddoc3
contain at least one word in the query. -
Compute the tf–idf statistic for each matching document. For each matching document, the tf–idf statistic for a multi-word query
'love dogs'
is the sum of the tf–idf statistics for'love'
and'dogs'
individually.-
since
'love'
doesn’t appear indoc1
. -
.
-
-
Associate each matching document with its tf–idf statistic in a list of tuples to sort by descending tf–idf statistic. Return the matching document paths in descending tf–idf order
['/home/other_files/doc3', '/home/other_files/doc1']
.
Info
Sorting by tf-idf with a list of tuples is just how we decided to do it. If you find that another method of storing and sorting makes more sense to you, like storing a dictionary mapping documents to their respective scores, feel free to try it out!
Warning
You might find that field variables are helpful as you implement your SearchEngine; think carefully about if a variable needs to be used outside of one function and hence wether or not it needs to be stored with self
.
Quality¶
Assessment submissions should pass these checks: flake8
and code quality guidelines. The code quality guidelines are very thorough. For this assessment, the most relevant rules can be found in these sections (new sections bolded):
-
-
Private Fields
Submission¶
Submit your work by pressing the Mark button. Submit as often as you want until the deadline for the initial submission. Note that we will only grade your most recent submission. You can view your past submissions using the “Submissions” button.
Please make sure you are familiar with the resources and policies outlined in the syllabus and the take-home assessments page.
THA 4 - Search
Initial Submission by Thursday 07/28 at 11:59 pm.
Submit on Ed