CSE 163, Spring 2019: Homework 4: Python Classes and Search Engines

Submission

This assignment and its reflection are due by Thursday, May 2 at 11:59 pm.

You should submit your finished document.py, search_engine.py, and hw4_test.py on Gradescope as a zip file (described below) and the reflection on Google Forms

Overview

At a high-level, a search-engine is an algorithm that accepts a query from the user, and computes how relevant a given document is to that query. The search engine can then rank documents according to their relevance.

In this assignment you will implement your own search engine by defining two classes, document.pyand search_engine.py. The Document class represents a single file in the search engine, and the SearchEngine class handles the functionality of querying the collection of stored Documents. The SearchEngine will use the TF-IDF (term frequency - inverse document frequency) algorithm to compute the relevance of a document to a given term.

The purpose of this assignment is to introduce you to writing cohesive python classes. This will include public methods, private methods, and data structures stored as fields. Additionally, you will work with common indexing algorithms, and consider efficient ways to store data that will be frequently accessed by the TF-IDF algorithm.

Learning Objectives

After this homework, students will be able to:

  • Write a complete Python class from scratch.
    • This will include writing two classes which interact with each other.
  • Implement basic Python classes to represent Documents and SearchEngines.
  • Identify appropriate data structures to store/access term indexes efficiently.
  • Understand fundamental document and term ranking algorithms to implement a basic search engine.
    • Specifically the TF-IDF algorithm which computes term-frequency and inverse document frequency.

Expectations

Here are some baseline expectations we expect you to meet:

Files

You should download the starter code hw4.zip and open it as the project in Visual Studio Code. The files included are:

  • main.py: A python file which will allow a user to run a new search engine at the command line. More information about this file is included later in the specification.
  • wikipedia: This directory contains a number of wikipedia files, and your SearchEngine can be used to query and rank the files.
  • hw4_test.py: The file you should put your tests for document.py and search_engine.py in.
  • cse163_utils.py: A file where we will store utility functions for helping you write tests.

The Algorithm

To implement the document ranking algorithm you will use the TF-IDF (term frequency- inverse document frequency) algorithm. This is just a fancy way of weighting documents based on their relevance to a search term. TF-IDF is one of the most popular term-weighting schemes - 83% of text-based recommender systems use TF-IDF.

The TF-IDF algorithm consists of two main computations. The following information describes these computations at a high-level, and the exact equations will be included later in the spec.

Term Frequency

The term frequency is computed on a document level, and it represents how often a search term appears in a specific document.

Inverse Document Frequency

Because filler words such as β€œthe” are so common, term frequency tends to incorrectly weight documents that use the word β€œthe” often. The inverse document frequency is a measure of how often the word appears in the entire set of documents. We then use this value to diminish the weight fo terms that occur very frequently.

Optional: TF-IDF

You are only required to understand the information about the algorithm covered in the spec. For those interested, more information about TF-IDF can be found here.

Table of Contents

  • Part 0: The Document Class

  • Part 1: The SearchEngine Class (single-word queries)

  • Part 2: The SearchEngine Class (multi-word queries)

  • Part 3: Testing and Running the SearchEngine

  • Part 4a: Submit Assignment and Part 4b: Complete Reflection. On Gradescope, you should submit a zip of your hw4 directory.

    Note: We are asking you to turn in your assignment in a different format this time. Because there are so many files and you might want to create your own directories containing text for testing, we are asking you to use zip to turn in the assignment. You should zip your hw4 folder and submit that to Gradescope (instructions below). Your zip archive should have the following structure when we unzip it (this should require no extra work if you just zip the hw4 folder). You can always test this by moving your zip archive to another directory, and unzipping it there to see if the archive had the correct structure.

    hw4
    β”œβ”€β”€ cse163_utils.py
    β”œβ”€β”€ document.py
    β”œβ”€β”€ hw4_test.py
    β”œβ”€β”€ main.py
    β”œβ”€β”€ search_engine.py
    β”œβ”€β”€ test_dir1
    β”‚Β Β  └── ...
    └── ...

    Where test_dir1 is one of the directory of test documents for your tests. You don't have to use that name, but you should incldue those directories of documents for testing.

    UPDATE: You do NOT need to include the wikipedia dataset in your submission. To accomplish this, move the wikipedia folder outside the hw4 folder before you zip it. You will probably want to move it back after you finish the zip so you can continue to develop your code.

How to zip

zip is a program to compress files into an archive for easily sharing files. Most modern operating systems have the ability to zip folders built in that you can use in the Finder/File Explorer. For both Mac and Windows, navigate to the folder that contains your hw4 directory in your Finder/File Explorer.

  • Mac: Right-click the hw4 folder and select Compress "hw4". This should create a file called hw4.zip that you can submit.

    You can view the contents of the archive by double-clicking on hw4.zip and it will unzip the archive; note that if you do this in the same directory as your hw4 directory, it will create a new directory called hw4 2 as to not overwrite your directory.

  • Windows: Right-click the hw4 folder and select Send to and then Compressed (zipped) folder. This should create a file called hw4.zip that you can submit.

    You can view the contents of the archive by double-clicking on hw4.zip and it will show you a preview of the archive (it should start by containing a hw4 directory).

Evaluation

Your submission will be evaluated on the following dimensions

  • Your solution correctly implements the described behaviors. You will not have access to tests when you turn in your assignment, All behavior we test is completely described by the problem specification or shown in an example.
  • The first line of search_engine.py and document.py is a comment with your uwnetid.
  • Your code meets our style requirements:
    • All code files submitted pass flake8
    • Every function written is commented, in your own words, using a doc-string format that describes its behavior, parameters, returns, and highlights any special cases.
    • There is a comment at the top of each code file you write with your name, section, and a brief description of what that program does.
    • Any expectations on this page or the sub-pages for the assignment are met as well as all requirements for each of the problems are met.
    • The fields of all your classes should be private (by having the names start with a _. You are allowed to access the fields directly in your tests since they are something you wrote personally and the tests don't need to behave like a true client should.