The Document
class represents a single file in the SearchEngine
, and will include functionality to compute the term frequency of a term in the document.
document.py
re
module (the use of re
will be described later in the spec), but you may not use any other imports.The initializer for the Document
class should take a single file name as an argument. A new Document
should be created to manage the terms in the original file. If the given file is empty, the Document
that is constructed should represent an empty file. You may assume the given string represents a valid file.
Write a function called term_frequency
that takes a term as a parameter and returns the term-frequency of the given term in the document. This function should not do any actual computation, but instead use the pre-computed dictionary. If the given term does not exist in the document, the function should return 0. If using the example in the blue box above, doc.term_frequency('dog')
should return 0.25.
The term_frequency
function should be case-insensitive, and remove punctation from the given word, using the regular expressions defined in the initializer.
Write a function get_words
which returns a list of all the words in this document. If there are duplicate words in the document they should only appear in the list once.
Note: You should not re-read the file in this method to compute the words list.
Thinking forward to Part 3, you can now start thinking about testing your Document
class to gain confidence your solution is correct. This is a good idea to start now so you can identify any bugs here, rather than trying to find them when implementing the SearchEngine
.
At this point, your Document
is fully completed. This means you should be able to compute some term-frequencies for small files you create and test any behaviors of this class described here.