The SearchEngine Class (single word queries)
The SearchEngine
class will manage a collection of Documents
, and handle computing the relevance of a Documents
to a given term. To do this, the SearchEngine
will compute the TF-IDF
of a given term for each Document
which contains the term.
Part of the development stragety for implementing search_engine.py
is building the solution up in parts. In Part 1, you will implement a search_engine.py
that supports only single-word queries. In Part 2, you will extend this to support multi-word queries.
Expectations
- You should create and implement a python file called
search_engine.py
- For this part of the assignment, you may import and use the
math
, os
, and re
modules as well as the document
module you wrote for Part 0, but you may not use any other imports.
- The first line of your file should be a comment with your uwnetid
- Your field(s) in this class should be private.
Important
We highly recommend reading the entire search_engine.py
(single word query) spec, and understanding the algorithm for ranking documents before implementing. It will be challenging to implement the necessary data structures without understanding the context for how they will be used in the search
function.
Constructing a SearchEngine
The SearchEngine
constructor should take a directory name (string), and construct an inverse index representing all documents in the given directory. You may assume that the given string represents a valid directory, and that the directory contains only valid files.
The os.listdir()
function will return a list of all the files and directories in a given directory.
import os
for file in os.listdir('directory_name'):
print(file)
# Output
# file1
# file2
# file3
You will use something very similar in your
SearchEngine
constructor. Each file that is returned by the
os.listdir()
is represented as only the file name. However, when you pass the file name into the
Document
constructor, it will try to open the file based on it's name. Since the file is in a different directory than the
Document
it will not be able to open the file. To avoid this, we must pass the
path of the file to the
Document
constructor. Since the directory containing the file will be in the same directory as the
Document
class, we can pass the
directory_name + '/' + file_name
to the
Document
constructor. The
directory_name
is just the string passed to the
SearchEngine
constructor.
When you implement the search
function it will be necessary to find all documents which contain the given search term. It is inefficient to iterate over each document in the SearchEngine
and check if it contains a given word. Instead, the constructor should precompute an inverse index. An inverse index is just a fancy term for a dictionary which maps from a term (string) to a list of Documents which contain that term. The visualization below descirbes what the inverse index should look like for a few example documents.
When you are constructing the inverse index, you should not manually iterate over the words in each Document
that is constructed in the SearchEngine
class. Instead you should use the methods of the Document
class to help you construct the inverse index.
Calculating the IDF
Write a function _calculate_idf
which takes a term as a string argument and returns the score for that term over all documents managed by the SearchEngine
. The IDF represents the inverse document frequency, and is the inverse of the proportion of documents in the SearchEngine
that contain the given term. The IDF should be calculated as follows:
\(
\begin{equation*}
IDF(t) = \begin{cases}
0 & t \text{ does not appear in the Search Engine}\\
ln(\frac{\text{total number of documents in Search Engine}}{\text{number of documents in Search Engine containing term } t}) & \text{otherwise}
\end{cases}
\end{equation*}\)
Where \( ln \) is the natural logarithm. This can be computed using the log
function in the math
module.
Note: _calculate_idf
is a private function, which will not be used directly by the client. However, it will be used by the search
function. We will expect that you have this private function with the behavior described so that we can test your code. As mentioned in lecture, it's generally a bad practice for clients to look at private functions, but we are making an exception for our tests so we can better test your code.
The Search Function
Write a function search
which takes a string as an argument and returns a list of document names. The documents should be sorted in descending order of TF-IDF. If no document in your SearchEngine
contains the term, the function should return None
. The search function should be case-insensitive, and ignore any punctuation in the given term.
The TF-IDF for term t on document D is defined as
\(
TFIDF(t, D) = TF(t, D) \cdot IDF(t)
\)
The Algorithm
To generate a ranked list of documents
- Compute the TF-IDF score for each document that contains the given term. You will probably want to store these scores in a list of tuples each containing the name of the document and its score.
- Sort the documents in descending order of TF-IDF.
- Return the document names as a list.