The SearchEngine
class will manage a collection of Documents
, and handle computing the relevance of a Documents
to a given term. To do this, the SearchEngine
will compute the TF-IDF
of a given term for each Document
which contains the term.
Part of the development stragety for implementing search_engine.py
is building the solution up in parts. In Part 1, you will implement a search_engine.py
that supports only single-word queries. In Part 2, you will extend this to support multi-word queries.
search_engine.py
math
, os
, and re
modules as well as the document
module you wrote for Part 0, but you may not use any other imports.We highly recommend reading the entire search_engine.py
(single word query) spec, and understanding the algorithm for ranking documents before implementing. It will be challenging to implement the necessary data structures without understanding the context for how they will be used in the search
function.
The SearchEngine
initializer should take a directory name (string), and construct an inverse index representing all documents in the given directory. You may assume that the given string represents a valid directory, and that the directory contains only valid files.
Write a function _calculate_idf
which takes a term as a string argument and returns the score for that term over all documents managed by the SearchEngine
. The IDF represents the inverse document frequency, and is the inverse of the proportion of documents in the SearchEngine
that contain the given term. The IDF should be calculated as follows:
Where \( ln \) is the natural logarithm. This can be computed using the log
function in the math
module.
Note: _calculate_idf
is a private function, which will not be used directly by the client. Because it's defined in the SearchEngine
class, you should use it to help you implement the search
function. We will expect that you have this private function with the behavior described so that we can test your code. As mentioned in lecture, it's generally a bad practice for clients to look at private functions, but we are making an exception for our tests so we can better test your code.
Write a function search
which takes a string as an argument and returns a list of document names. The documents should be sorted in descending order of TF-IDF. If no document in your SearchEngine
contains the term, the function should return None
. The search function should be case-insensitive, and ignore any punctuation in the given term.
The Algorithm
To generate a ranked list of documents
It will help to understand a full-example of what the search
algorithm should return for an example corpus. Consider we had the following three files in a directory test-files
(shown as file name / content pairs). Notice these documents are slightly different than the last part.
test-files/doc1.txt
: Dogs are the greatest pets.
test-files/doc2.txt
: Cats seem pretty okay.
test-files/doc3.txt
: I love dogs!
Then, if we were to search for the term 'dogs'
, the search
algorithm would follow the following steps:
'dogs'
. In this case, the relevant files are test-files/doc1.txt
and test-files/doc3.txt
.'dogs'
in test-files/doc1.txt
, which in this example is 0.081
. This is because the IDF of the term 'dogs'
in the corpus is math.log(3/2)
and the TF of the term 'dogs'
in test-files/doc1.txt
is 1/5
.'dogs'
in test-files/doc3.txt
, which in this example is 0.135
. This is because the IDF of the term 'dogs'
in the corpus is math.log(3/2)
and the TF of the term 'dogs'
in test-files/doc3.txt
is 1/3
.[('test-files/doc1.txt', 0.081), ('test-files/doc3.txt', 0.135)]
.[('test-files/doc3.txt', 0.135), ('test-files/doc1.txt', 0.081)]
.['test-files/doc3.txt', 'test-files/doc1.txt']
.Thinking forward to Part 3, you can now start thinking about testing your SearchEngine
class to gain confidence your solution is correct. This is a good idea to start now so you can identify any bugs here, rather than trying to find them when implementing the multi-word queries in SearchEngine
.
At this point, your SearchEngine
is complete enough to run a query with a single term in it. It is hard to make a test on the entire wikipedia
dataset (since you don't know what the expected answer is), so making a test-case for the example shown above would be a great starting point!