In this homework, you’ll implement a basic search engine by defining your own Python classes. A search engine is an algorithm that takes a query and retrieves the most relevant documents for that query. In order to identify the most relevant documents, our search engine will use term frequency–inverse document frequency (tf–idf), an information statistic for determining the relevance of a term to each document from a corpus consisting of many documents.
The tf–idf statistic is a product of two values: term frequency and inverse document frequency. Term frequency computes the number of times that a term appears in a document (such as a single Wikipedia page). If we were to use only the term frequency in determining the relevance of a term to each document, then our search result might not be helpful since most documents contain many common words such as “the” or “a”. In order to downweight these common terms, the document frequency computes the number of times that a term appears across the corpus of all documents. We’ve provided some example corpuses in the doggos directory and the small_wiki directory.
!pip install -qq nb_mypy pytest ipytest
%reload_ext nb_mypy
%nb_mypy mypy-options --strictimport os
import math
import pytest
import ipytest
ipytest.autoconfig()Throughout this assignment, use the following function to normalize a token.
import re
def normalize(token: str, pattern: re.Pattern[str] = re.compile(r"\W+")) -> str:
"""
Returns all the characters in the token lowercased and without matches to the given pattern.
>>> normalize("CoRgi")
'corgi'
>>> normalize("corgi!!")
'corgi'
>>> normalize("corgi")
'corgi'
>>> normalize("")
''
"""
return pattern.sub("", token.lower())Task: Document¶
Write and test a Document class in the code cell below that can be used to represent the text in a web page and includes methods to compute term frequency. (But not document frequency since that would require access to all the documents in the corpus.)
The Document class should include:
An initializer that takes a
strpath and initializes the document data based on the text in the specified file. Assume that the file exists, but that it could be empty. In order to implementterm_frequencylater, we’ll need to precompute and save the term frequency for each term in the document in the initializer as a field by constructing a dictionary that maps eachstrterm to itsfloatterm frequency. Term frequency is defined as the count of the given term divided by the total number of words in the text.
Consider the term frequencies for this document containing 4 total words: “the cutest cutest dog”.
“the” appears 1 time out of 4 total words, so its term frequency is 0.25.
“cutest” appears 2 times out of 4 total words, so its term frequency is 0.5.
“dog” appears 1 time out of 4 total words, so its term frequency is 0.25.
When constructing this dictionary, call normalize to convert the input token to lowercase and ignore non-letter characters so that “corgi”, “CoRgi”, and “corgi!!” are all considered the same string “corgi” to the search algorithm. The empty string is valid and should be counted as well.
A method
term_frequencythat takes a givenstrterm and returns its term frequency by looking it up in the precomputed dictionary. Remember to normalize the term before looking it up to find the corresponding match. If the term does not occur, return 0.A method
get_paththat returns thestrpath of the file that this document represents.A method
get_wordsthat returns asetof the unique, normalized words in this document.A method
__repr__that returns a string representation of this document in the formatDocument('{path}')(with literal single quotes in the output) where{path}is the path to the document from the initializer. The__repr__method is called when Jupyter Notebook needs to display aDocumentas output, so we should be able to copy the string contents into a new code cell and immediately run it to create a newDocument.
To test your Document, create your own testing corpus by creating a New Folder and adding your own text files to it. Then, for each of the 4 methods (excluding the initializer) in the Document class, write a testing function that contains at least 3 pytest-style assertions based on your own testing corpus and written as expressions like 1 / 5 that show your thinking. As always, your test cases should check for edge cases. Documentation strings are optional for testing functions.
Be sure to exhaustively test your Document class before moving on: bugs in Document will make implementing the following SearchEngine task much more difficult.
class Document:
...
doc1 = Document("doggos/doc1.txt")
euro = Document("small_wiki/Euro - Wikipedia.html")
...
def test_term_frequency() -> None:
assert doc1.term_frequency("dogs") == pytest.approx(1 / 5)
assert euro.term_frequency("Euro") == pytest.approx(245 / 28376)
...
def test_get_words() -> None:
assert doc1.get_words() == set("dogs are the greatest pets".split())
assert set(w for w in euro.get_words() if len(w) == 1) == set([
*"0123456789acefghijklmnopqrstuvxyz".lower() # All one-letter words in Euro
])
...
...
ipytest.run("-q");Task: SearchEngine¶
Write and test a SearchEngine class in the code cell below that represents a corpus of Document objects and includes methods to compute the tf–idf statistic between a given query and every document in the corpus. The SearchEngine begins by constructing an inverted index that associates each term in the corpus to the list of Document objects that contain the term.
To iterate over all the files in a directory, call os.listdir to list all the file names and join the directory to the filename with os.path.join. The example below will print only the .txt files in the doggos directory.
path = "doggos"
extension = ".txt"
for filename in os.listdir(path):
if filename.endswith(extension):
print(os.path.join(path, filename))The SearchEngine class should include:
An initializer that takes a
strpath to a directory such as"small_wiki"and astrfile extension and constructs an inverted index from the files in the specified directory matching the given extension. By default, the extension should be".txt". Assume the string represents a valid directory, and that the directory contains only valid files. Do not recreate any behavior that is already done in theDocumentclass—call theget_words()method! Create at most oneDocumentper file.A method
_calculate_idfthat takes astrterm and returns the inverse document frequency of that term. If the term is not in the corpus, return 0. Inverse document frequency is defined by callingmath.logon the total number of documents in the corpus divided by the number of documents containing the given term.A method
__repr__that returns a string representation of this search engine in the formatSearchEngine('{path}')(with literal single quotes in the output) where{path}is the path to the directory specified in the initializer.A method
searchthat takes astrquery consisting of one or more terms and returns alistof relevant document paths that match at least one of the normalized terms sorted by descending tf–idf statistic: the product of the term frequency and inverse document frequency. If there are no matching documents, return an empty list.
To test your SearchEngine, for each of the 3 methods (excluding the initializer) in the SearchEngine class, write a testing function that contains at least 3 pytest-style assertions based on your own testing corpus and written as expressions like 1 / 5 that show your thinking. Test cases for SearchEngine.__repr__ may use the given corpuses doggos and small_wiki in addition to your corpus. Documentation strings are optional for testing functions.
The type checker can be too picky about the sorted key parameter: a type error here can be safely ignored if the code works as intended. You can add the comment # type: ignore[arg-type] to the end of the problematic line to ignore this type error.
class SearchEngine:
...
doggos = SearchEngine("doggos")
small_wiki = SearchEngine("small_wiki", ".html")
...
def test_search() -> None:
assert doggos.search("love") == ["doggos/doc3.txt"]
assert doggos.search("dogs") == ["doggos/doc3.txt", "doggos/doc1.txt"]
assert doggos.search("cats") == ["doggos/doc2.txt"]
assert doggos.search("love dogs") == ["doggos/doc3.txt", "doggos/doc1.txt"]
assert small_wiki.search("data")[:10] == [
"small_wiki/Internet privacy - Wikipedia.html",
"small_wiki/Machine learning - Wikipedia.html",
"small_wiki/Bloomberg L.P. - Wikipedia.html",
"small_wiki/Waze - Wikipedia.html",
"small_wiki/Digital object identifier - Wikipedia.html",
"small_wiki/Chief financial officer - Wikipedia.html",
"small_wiki/UNCF - Wikipedia.html",
"small_wiki/Jackson 5 Christmas Album - Wikipedia.html",
"small_wiki/KING-FM - Wikipedia.html",
"small_wiki/The News-Times - Wikipedia.html",
]
...
...
ipytest.run("-q");We recommend the following iterative software development approach to implement the search method.
Write code to handle queries that contain only a single term by collecting all the documents that contain the given term, computing the tf–idf statistic for each document, and returning the list of document paths sorted by descending tf–idf statistic.
Write tests to ensure that your program works on single-term queries.
Write code to handle queries that contain more than one term by returning all the documents that match any of the terms in the query sorted by descending tf–idf statistic. The tf–idf statistic for a document that matches more than one term is defined as the sum of its constituent single-term tf–idf statistics.
Write tests to ensure that your program works on multi-term queries.
Here’s a walkthrough of the search function from beginning to end. Say we have a corpus in a directory called "doggos" containing 3 documents with the following contents:
doggos/doc1.txtwith the textDogs are the greatest pets.doggos/doc2.txtwith the textCats seem pretty okaydoggos/doc3.txtwith the textI love dogs!
The initializer should construct the following inverted index.
{"dogs": [doc1, doc3],
"are": [doc1],
"the": [doc1],
"greatest": [doc1],
"pets": [doc1],
"cats": [doc2],
"seem": [doc2],
"pretty": [doc2],
"okay": [doc2],
"i": [doc3],
"love": [doc3]}Searching this corpus for the multi-term query "love dogs" should return a list ["doggos/doc3.txt", "doggos/doc1.txt"] by:
Finding all documents that match at least one query term. The word
"love"is found indoc3while the word"dogs"is found indoc1anddoc3.Computing the tf–idf statistic for each matching document. For each matching document, the tf–idf statistic for a multi-word query
"love dogs"is the sum of the tf–idf statistics for"love"and"dogs"individually.For
doc1, the sum of 0 + 0.081 = 0.081. The tf–idf statistic for"love"is 0 because the term is not indoc1.For
doc3, the sum of 0.366 + 0.135 = 0.501.
Returning the matching document paths sorted by descending tf–idf statistic.
After completing your SearchEngine, run the following cell to search our small Wikipedia corpus for the query “data”. Try some other search queries too!
SearchEngine("small_wiki", ".html").search("data")