CSE 163, Spring 2019: Homework 4: Part 0

The Document Class

The Document class represents a single file in the SearchEngine, and will include functionality to compute the term frequency of a term in the document.

Expectations

  • You should create and implement a python file called document.py
  • For this part of the assignment, you may import and use the re module (the use of re will be described later in the spec), but you may not use any other imports.
  • The first line of your file should be a comment with your uwnetid.
  • Your field(s) in this class should be private.

Constructing a Document

The constructor for the Document class should take a single file name as an argument. A new Document should be created to manage the terms in the original file. If the given file is empty, the Document that is constructed should represent an empty file. You may assume the given string represents a valid file.

Precomputing

Computing the term_frequency for a document is a relatively inefficient process. We must iterate over the entire document to calculate the frequency of a single word. To make this process more efficient we can pre-compute the term frequencies of a document. The constructor should construct a dictionary which maps from terms (string) to their frequency in the document (floats).

When you are constructing your dictionary of terms, all terms should be case insensitive and ignore punctuation. This means that β€œcorgi”, β€œCoRgi”, and β€œcorgi!!” are all considered the same term. You can use the following regular expression to remove punctuation from a string: token = re.sub(r'\W+', '', token) This will remove all punctuation from the string stored in token. You must import the re module to use this regular expression.

Computing Term Frequency

The term frequency is defined as:

\( TF(t) = \frac{\text{number of occurences of term } t \text{ in the document}}{\text{number of words in the document}} \)

Therefore, if we had a document

the cutest cutest dog

Then the frequencies for the given document would be

"the"    : 0.25 # 1 / 4
"cutest" : 0.5  # 1 / 2
"dog"    : 0.25 # 1 / 4

Term Frequency

Write a function called term_frequency that takes a term as a parameter and returns the frequency of the given term in the document. This function should not do any actual computation, but instead use the pre-computed dictionary. If the given term does not exist in the document, the function should return 0.

The term_frequency function should be case-insensitive, and remove punctation from the given word, using the regular expressions defined in the constructor.

Getting the Words

Write a function get_words which returns a list of all the words in this document. If there are duplicate words in the document they should only appear in the list once.

Note: You should not iterate over the file to compute the words list.