CSE 163, Winter 2020: Homework 4: Part 1

The SearchEngine Class (single word queries)

The SearchEngine class will manage a collection of Documents, and handle computing the relevance of a Documents to a given term. To do this, the SearchEngine will compute the TF-IDF of a given term for each Document which contains the term.

Part of the development stragety for implementing search_engine.py is building the solution up in parts. In Part 1, you will implement a search_engine.py that supports only single-word queries. In Part 2, you will extend this to support multi-word queries.

Expectations

  • You should create and implement a python file called search_engine.py
  • For this part of the assignment, you may import and use the math, os, and re modules as well as the document module you wrote for Part 0, but you may not use any other imports.
  • Your field(s) in this class should be private.

Important

We highly recommend reading the entire search_engine.py (single word query) spec, and understanding the algorithm for ranking documents before implementing. It will be challenging to implement the necessary data structures without understanding the context for how they will be used in the search function.

Constructing a SearchEngine

The SearchEngine initializer should take a directory name (string), and construct an inverse index representing all documents in the given directory. You may assume that the given string represents a valid directory, and that the directory contains only valid files.

Iterating over a Directory

The os.listdir() function will return a list of all the files and directories in a given directory.

import os
for file_name in os.listdir(directory_name):
    print(file_name)

    # Output
    # file1.txt
    # file2.txt
    # file3.txt

You will use something very similar in your SearchEngine initializer. Each file that is returned by the os.listdir() is represented as only the file name. However, when you pass the file name into the Document initializer, it will try to open the file based on its name. Since the file is in a different directory, the Document will not be able to open the file. To avoid this, we must pass the path of the file to the Document initializer. Since the directory containing the file will be in the same directory as the Document class, we can pass the directory_name + '/' + file_name to the Document initializer. The directory_name is just the string passed to the SearchEngine initializer.

For Mac Users

Macs sometimes create hidden files named .DS_Store that can cause your code to break since they aren't valid text files. One option is to make sure to delete this file. Another option is you are allowed to put a special condition in your code to ignore any files that start with a '.'. You may assume no valid file starts with a '.'. For example, you could write something like:

if not file_name.startswith('.')

Inverse Index

When you implement the search function it will be necessary to find all documents which contain the given search term. It is inefficient to iterate over each document in the SearchEngine and check if it contains a given word. Instead, the initializer for SearchEngine should pre-compute an inverse index. An inverse index is just a fancy term for a dictionary which maps from a term (string) to a list of Documents which contain that term. The visualization below describes what the inverse index should look like for a few example documents.

inverse index

When you are constructing the inverse index in SearchEngine, you should avoid re-implementing any behavior that the Document can provide for you.

Calculating the IDF

Write a function _calculate_idf which takes a term as a string argument and returns the score for that term over all documents managed by the SearchEngine. The IDF represents the inverse document frequency, and is the inverse of the proportion of documents in the SearchEngine that contain the given term. The IDF should be calculated as follows:

\( \begin{equation*} IDF(t) = \begin{cases} 0 & t \text{ does not appear in the Search Engine}\\ ln\left(\frac{\text{total number of documents in Search Engine}}{\text{number of documents in Search Engine containing term } t}\right) & \text{otherwise} \end{cases} \end{equation*}\)

Where \( ln \) is the natural logarithm. This can be computed using the log function in the math module.

Note: _calculate_idf is a private function, which will not be used directly by the client. Because it's defined in the SearchEngine class, you should use it to help you implement the search function. We will expect that you have this private function with the behavior described so that we can test your code. As mentioned in lecture, it's generally a bad practice for clients to look at private functions, but we are making an exception for our tests so we can better test your code.

The Search Function

Write a function search which takes a string as an argument and returns a list of document names. The documents should be sorted in descending order of TF-IDF. If no document in your SearchEngine contains the term, the function should return None. The search function should be case-insensitive, and ignore any punctuation in the given term.

Computing the TF-IDF

The TF-IDF for term t on document D is defined as

\( TFIDF(t, D) = TF(t, D) \cdot IDF(t) \)

The Algorithm

To generate a ranked list of documents

  • Compute the TF-IDF score for each document that contains the given term. You will probably want to store these scores in a list of tuples each containing the name of the document and its score.
  • Sort the documents in descending order of TF-IDF.
  • Return the document names as a list.

Example

It will help to understand a full-example of what the search algorithm should return for an example corpus. Consider we had the following three files in a directory test-files (shown as file name / content pairs). Notice these documents are slightly different than the last part.

  • test-files/doc1.txt: Dogs are the greatest pets.
  • test-files/doc2.txt: Cats seem pretty okay.
  • test-files/doc3.txt: I love dogs!

Then, if we were to search for the term 'dogs', the search algorithm would follow the following steps:

  1. Find all the documents that contain the word 'dogs'. In this case, the relevant files are test-files/doc1.txt and test-files/doc3.txt.
  2. Compute the TF-IDF score for each relevant document:
    • Compute TF-IDF for 'dogs' in test-files/doc1.txt, which in this example is 0.081. This is because the IDF of the term 'dogs' in the corpus is math.log(3/2) and the TF of the term 'dogs' in test-files/doc1.txt is 1/5.
    • Compute TF-IDF for 'dogs' in test-files/doc3.txt, which in this example is 0.135. This is because the IDF of the term 'dogs' in the corpus is math.log(3/2) and the TF of the term 'dogs' in test-files/doc3.txt is 1/3.
  3. We recommend storing this in a list of tuples so we can sort it in the next step. In this example, this list would look like [('test-files/doc1.txt', 0.081), ('test-files/doc3.txt', 0.135)].
  4. Then we sort the tuples so they are sorted by TF-IDF in descending order. In this example, this sorted list would look like [('test-files/doc3.txt', 0.135), ('test-files/doc1.txt', 0.081)].
  5. Then we can return the result ['test-files/doc3.txt', 'test-files/doc1.txt'].

Testing

Thinking forward to Part 3, you can now start thinking about testing your SearchEngine class to gain confidence your solution is correct. This is a good idea to start now so you can identify any bugs here, rather than trying to find them when implementing the multi-word queries in SearchEngine.

At this point, your SearchEngine is complete enough to run a query with a single term in it. It is hard to make a test on the entire wikipedia dataset (since you don't know what the expected answer is), so making a test-case for the example shown above would be a great starting point!