This assignment and its reflection are due by Thursday, Feb 6 at 11:59 pm.
You should submit your finished
document.py
,
search_engine.py
,
and hw4_test.py
on Ed and the reflection on Google Forms
DO NOT submit the wikipedia
dataset.
At a high-level, a search-engine is an algorithm that accepts a query from the user, and computes how relevant a given document is to that query. The search engine can then rank documents according to their relevance.
In this assignment you will implement your own search engine by defining two classes, Document
and SearchEngine
. The Document
class represents a single file in the search engine, and the SearchEngine
class handles the functionality of querying the collection of stored Document
s. The SearchEngine
will use the TF-IDF (term frequency - inverse document frequency) algorithm to compute the relevance of a document to a given term.
The purpose of this assignment is to introduce you to writing cohesive python classes. This will include public methods, private methods, and data structures stored as fields. Additionally, you will work with common indexing algorithms, and consider efficient ways to store data that will be frequently accessed by the TF-IDF algorithm.
You will probably find that this is the most challenging assignment of the whole quarter. This is intended since it is the last assignment that you will be doing without working on your project in parallel. The remaining assignments won't be as complex as this one so it's okay to feel a bit lost the first time you read over this spec. There are a lot of instructions and it gives guidance on how to develop this assignment in parts to this complexity.
After this homework, students will be able to:
Here are some baseline expectations we expect you to meet:
Follow the course collaboration policies
The files included are:
main.py
: A python file which will allow a user to run a new search engine at the command line. More information about this file is included later in the specification.wikipedia
: This directory contains a number of wikipedia files, and your SearchEngine
can be used to query and rank the files.small-wiki
: This directory contains a smaller number of wikipedia files, and your SearchEngine
can be used to query and rank the files.hw4_test.py
: The file you should put your tests for document.py
and search_engine.py
in.cse163_utils.py
: A file where we will store utility functions for helping you write tests.If you are working locally, you should download the starter code hw4.zip and open it as the project in Visual Studio Code.
If you are developing on Ed, you don't need to download anything. An important thing to note, is that on Ed the wikipedia
directory is stored at the path /course/wikipedia
.
To implement the document ranking algorithm you will use the TF-IDF (term frequency- inverse document frequency) algorithm. This is just a fancy way of weighting documents based on their relevance to a search term. TF-IDF is one of the most popular term-weighting schemes - 83% of text-based recommender systems use TF-IDF.
The TF-IDF algorithm consists of two main computations. The following information describes these computations at a high-level, and the exact equations will be included later in the spec.
Term Frequency
The term frequency is computed on a document level, and it represents how often a search term appears in a specific document.
Inverse Document Frequency
Because filler words such as βtheβ are so common, term frequency tends to incorrectly weight documents that use the word βtheβ often. The inverse document frequency is a measure of how often the word appears in the entire set of documents. We then use this value to diminish the weight of terms that occur very frequently.
Note: The instructions for running the main program and testing are not until Part 3, but you might want to read that earlier!
Part 4a: Submit Assignment and Part 4b: Complete Reflection. On Ed, you should submit the files listed at the top of the spec and any test directories you need.
Your submission will be evaluated on the following dimensions
flake8
_
. You are allowed to access the fields directly in your tests since they are something you wrote personally and the tests don't need to behave like a true client should.