CSE logo University of Washington Computer Science & Engineering
 CSE 490h: Problem-solving on large-scale clusters: theory and applications
  CSE Home   About Us    Search    Contact Info 

 Home
 Calendar & Readings
 Grading & Objectives
 Projects
 Staff
 Anonymous Feedback
 Mail archiveCSE only
    Summary:
In this project, you will get an introduction to some of the concepts in involved in indexing and ranking web documents. You will use MapReduce to perform a task called "Offline Query Processing." What that means is, given a query and a repository, which document(s) best match that query.

Details:
In a traditional search engine doing this in real time is important. To achieve this, an inverted index is used to quickly tell which documents match a query, and that index contains just enough information to score the documents. You generally have the same information about every document. You will know what documents contain a word, but you wouldn't know, for example, what the adjacent words in the document were.

Processing queries offline actually allows you to "index" whatever information you want about each individual document. You can think of it as selectively indexing documents that match the query, and then using that temporary index to score just those documents. In theory, you should be able to do a much better job than a traditional real time index/query system.

Once you have a repository, there are two main steps in ranking. First, you need to figure out which documents you want to score. This is called document selection. This generally means "which documents contain all the words in my query," but dont forget about OR queries, etc. Once you have a set of documents to score, you need to look at the features of each document, and come up with a score to represent how well that document matches the query. This is called scoring.

How to get started:
We are going to give you some starter code. This code will
1) Read in the repository and query
2) For each document, determine if it matches the query
3) If so, emit a term vector for that document. A term vector is a mapping of terms to the number of times that term occurred.
4) Score documents.
5) Sort the documents by score.

Let's talk a little bit more about each step.
1) Read in the repository. A repository is just a collection of documents, in this case, web pages found on the internet. You will want to setup you mapreduce to use the repository as one input, and a text file with your query as another.
2) For each document, determine if it matches the query. In a simple query language, one in which all  terms are ANDed together, checking if a document matches is very easy. Simply create a hash table that maps query terms to bits, start all the bits at 0, and as you process the document, check to see if the has table contains each term. If it does, flip that bit to 1, and move on. If, after processing the whole document, all the values in your hash table are 1, you have a match.
3) If so, emit a term vector for that document. Generating a term vector could be done at the same time. You could create a new hash table that maps terms to ints. While considering each term, look it up in this table. If it doesnt exist, insert it with a value of 1. If it does, increment the value by 1. When you are done, you know how many times each word occured in this document. When you emit this vector, the key should be the document (url) this feature is relevant to, and the value should be that feature (the term vector).
4) Score documents. In this most simple example, we say a document is a good match if the query terms occur a lot. This is actually not a
good method.
5) When you are done, you will have a mapping of documents to their score. Run another quick mapreduce to sort these by score and look at the first few entries.

Mandatory Extensions:
1) Use anchor text as a feature. While processing page A, when you see a link to B with the text T, you should emit a feature for page B. <B,T>. T can then be used while scoring page B even though you found it on page A.
2) Use one additional feature of your own invention.
3) Develop a scoring algorithm that uses 1 and 2.

Optional Extensions:
1) Enhanced query language. Just using AND to combine terms can get kinda boring. You could implement a more robust query language that allows for things like OR, regular expressions, numeric ranges, stems (look up porter stemming algorithm).
2) Score multiple queries at once. How much more work do you think it is to process 100 queries than 1? How does this change your Reduce?
3) Other link based features. Can you think of link based features? Maybe something as simple as counting in/out links. Maybe something as complicated as computing page rank, and combining with 4.
4) Multiple passes. Can you do more in two mapreductions than one?