CSE 163, Spring 2019: Homework 4: Part 1

The SearchEngine Class (single word queries)

The SearchEngine class will manage a collection of Documents, and handle computing the relevance of a Documents to a given term. To do this, the SearchEngine will compute the TF-IDF of a given term for each Document which contains the term.

Part of the development stragety for implementing search_engine.py is building the solution up in parts. In Part 1, you will implement a search_engine.py that supports only single-word queries. In Part 2, you will extend this to support multi-word queries.

Expectations

  • You should create and implement a python file called search_engine.py
  • For this part of the assignment, you may import and use the math, os, and re modules as well as the document module you wrote for Part 0, but you may not use any other imports.
  • The first line of your file should be a comment with your uwnetid
  • Your field(s) in this class should be private.

Important

We highly recommend reading the entire search_engine.py (single word query) spec, and understanding the algorithm for ranking documents before implementing. It will be challenging to implement the necessary data structures without understanding the context for how they will be used in the search function.

Constructing a SearchEngine

The SearchEngine constructor should take a directory name (string), and construct an inverse index representing all documents in the given directory. You may assume that the given string represents a valid directory, and that the directory contains only valid files.

Iterating over a Directory

The os.listdir() function will return a list of all the files and directories in a given directory.

import os
for file in os.listdir('directory_name'):
    print(file)

    # Output
    # file1
    # file2
    # file3
    
You will use something very similar in your SearchEngine constructor. Each file that is returned by the os.listdir() is represented as only the file name. However, when you pass the file name into the Document constructor, it will try to open the file based on it's name. Since the file is in a different directory than the Document it will not be able to open the file. To avoid this, we must pass the path of the file to the Document constructor. Since the directory containing the file will be in the same directory as the Document class, we can pass the directory_name + '/' + file_name to the Document constructor. The directory_name is just the string passed to the SearchEngine constructor.

Inverse Index

When you implement the search function it will be necessary to find all documents which contain the given search term. It is inefficient to iterate over each document in the SearchEngine and check if it contains a given word. Instead, the constructor should precompute an inverse index. An inverse index is just a fancy term for a dictionary which maps from a term (string) to a list of Documents which contain that term. The visualization below descirbes what the inverse index should look like for a few example documents.

inverse index

When you are constructing the inverse index, you should not manually iterate over the words in each Document that is constructed in the SearchEngine class. Instead you should use the methods of the Document class to help you construct the inverse index.

Calculating the IDF

Write a function _calculate_idf which takes a term as a string argument and returns the score for that term over all documents managed by the SearchEngine. The IDF represents the inverse document frequency, and is the inverse of the proportion of documents in the SearchEngine that contain the given term. The IDF should be calculated as follows:

\( \begin{equation*} IDF(t) = \begin{cases} 0 & t \text{ does not appear in the Search Engine}\\ ln(\frac{\text{total number of documents in Search Engine}}{\text{number of documents in Search Engine containing term } t}) & \text{otherwise} \end{cases} \end{equation*}\)

Where \( ln \) is the natural logarithm. This can be computed using the log function in the math module.

Note: _calculate_idf is a private function, which will not be used directly by the client. However, it will be used by the search function. We will expect that you have this private function with the behavior described so that we can test your code. As mentioned in lecture, it's generally a bad practice for clients to look at private functions, but we are making an exception for our tests so we can better test your code.

The Search Function

Write a function search which takes a string as an argument and returns a list of document names. The documents should be sorted in descending order of TF-IDF. If no document in your SearchEngine contains the term, the function should return None. The search function should be case-insensitive, and ignore any punctuation in the given term.

Computing the TF-IDF

The TF-IDF for term t on document D is defined as

\( TFIDF(t, D) = TF(t, D) \cdot IDF(t) \)

The Algorithm

To generate a ranked list of documents

  • Compute the TF-IDF score for each document that contains the given term. You will probably want to store these scores in a list of tuples each containing the name of the document and its score.
  • Sort the documents in descending order of TF-IDF.
  • Return the document names as a list.