An inverted index
is a mapping of words to their location in a set of documents. Most
modern search engines utilize some form of an inverted index to process
user-submitted queries. In its most basic form, an inverted index is a
simple hash table which maps words in the documents to some sort of
document identifier. For example, if given the following 2 documents:
Doc1:
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
Doc2:
Buffalo are mammals.
we could construct the following inverted file index:
Buffalo -> Doc1, Doc2
buffalo -> Doc1
buffalo. -> Doc1
are -> Doc2
mammals. -> Doc2
For this project, you will work individually to construct a simple
inverted file index to hash words to a document ID. You will turn
in your Mapper
(s) and Reducer
(s), and a short
writeup with your findings.
Note that we are not specifying what corpus you should index,
to give you the chance to pick a corpus which is interesting to you. Note
also that the input formats between various corpora differ; it is not
unreasonable to expect that your Mapper
s and
Reducer
s cannot not be used as-is between shakespeare
and the web crawl.
Load a test corpus into HDFS; this corpus can consist of any documents you
desire. We suggest creating at least one test corpus not larger
than a few hundred words, as a small test corpus might help you debug your
Mapper
and Reducer
classes if necessary. The
final Mapper
s and Reducer
s which you turn in are
not required to be compatible with the test corpus from part 0.
Please describe your test corpus (including HDFS path) in the write-up specified below.
Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. For the first part of this project, run a word count over an interesting corpus (eg, Shakespeare or the web crawl data) to identify any such words.
In the required write-up, please explain how you determined a word was "noisy".
For this portion of the assignment, you will design a MapReduce-based algorithm to calculate the inverted index over the corpus you used in Part 1. Your final inverted index should not contain the words identified in part 1 of this project. How you choose to remove those words is up to you; one possibility is to create multiple MapReduce passes, but there are many possible ways to do this. Also, the format of your MapReduce output (ie, the inverted index) must be simple enough to be machine-parseable; it is not impossible to imagine your index being one of many data structures used in a search engine's indexing pipeline. Lastly, your submitted indexer should be able to run successfully on the corpus specified in your writeup. "Successfully" means it should run to completion without errors or exceptions, and generate the correct word->DocID mapping.
You will be required to turn in all relevant Mapper
and Reducer
Java files, in addition to any supporting
code or utilities. Your write-up should specify what corpus your
Mapper
(s) and Reducer
(s) are designed for.
In addition to the requirements detailed above, you will need to implement at least one extension of your choice. This extension may be as simple or as complex as you'd like; we primarily want to give you a chance to play with the MapReduce system. We have provided some sample extensions below, but don't let our suggestions limit your creativity!
Please describe what extension(s) you chose to do in the final write-up.
Please use the turnin
utility
(described
here) to turn in all your code plus a .txt file with the answers
the below questions:
Mapper
s and Reducer
s
designed for?
The project name in the turnin
system is
"project1
". The project is due Thursday, Jan 18th at 6 PM.
Please include your source files, your jar file, as well as the dfs location of your
input (and output) files. Tar up a directory containing these, and run turnin.
We have provided a list of extensions below; however, feel free to do extensions not in this list.
<B>Bolded
text</B>
" should parse as "Bolded
" and
"text
", not "<B>Bolded
"
and "text</B>
"