CSE373—Data Structures and Algorithms for Nonmajors

Programming Assignment #3

due: Friday, 2/15/08, 10:00 PM

Introduction:

The problem of reverse lookup is one frequently encountered when working on problems when data maps to data.  One example is a phonebook, which is a map of names to phone numbers.  If you want to identify the name corresponding to a phone number, you are out of luck, unless you search through potentially all entries in the phonebook.  However, often you don’t just want to compute just one reverse lookup – you want to do many.  An alternative is to precompute all reverse lookups into a data structure and query the data structure later. 

The data structure that works really well for this query is the HashMap.  The phonebook itself is an example of a TreeMap that maps names to phone numbers (alphabetical order).  Order doesn’t matter for the reverse lookup, so we can get optimal runtime by mapping phone numbers to names with a hashmap (O(n) to build, O(1) to query).  It’s really cool that no matter how large the phonebook gets, it is still O(1) to query the reverse lookup.

For this assignment, we will look at links on webpages.  A webpage consists of html and likely contains several links to other pages embedded in <a href=http://...> tags.  But it’s hard to find out what webpages are pointing to it, again, without searching all possible webpages.  One application of finding out who (or how many) link to a webpage is for part of the process of ranking the importance of webpages.  The full idea is that a page is important if it is linked to by important webpages.  We’ll simplify the notion for this assignment and just count links to webpages.  (To get an idea of how advanced this can become, think about how you’d have to have a policy for how often to crawl the web for updates, and think how easy it would be for spammers to engineer sets of webpages that have high rank, meaning that you need to detect their trickery).

The first part of the assignment is to count the number of times a webpage is referenced by another webpage.  We’ll consider four different types of links: links that are internal to a user’s pages (from me to me), internal links that point a user’s home (from me to my home), links from other sites that point to  one of a user’s pages (from not me to me), and links from other sites that point to a user’s home (from not  me to my home).  I’ve already done the work of crawling the CSE webpages and extracting the link information.  The processLinks method reads that data and gets the toURL and fromURL variables established; you need to fill in the rest.  There’s a method getName that I wrote that figures out a URL belongs to a user and if it’s their homepage.

The idea is if there is a link from p to q, then we’ll have a HashMap<String,Integer> that maps the toURLs to the different counts (a class with four fields, one per count), and we’d do something like

LinkData ld = hashmap.get(q);

if (ld==null) hashmap.put(q,ld=new LinkData());

ld.numLinks++;

Later we could query the hashmap to get the counts.

The next step is to take this data and get ranks.  We just made a map of Strings to data containing Integers.  But to find out ranks, it would be much nicer to have data of Integers to Strings.  This is another reverse lookup problem.  We’ll use a TreeMap here because it will give us the ordering (see the descendingKeySet method). (oops, descending KeySet in Java1.6... do new TreeMap(Collections.reverseOrder()) and then later just keySet)   For a HashMap entry, you find the appropriate number field in the value and use that as the TreeMap key, with the String that was the HashMap key serving as the value.  When you iterate over the first N items of the descendingKeySet, you can extract the top N ranked usernames and their respective link counts.  Because the set of users is much smaller than the set of webpages, we’ll leave it up to you whether to compute the TreeMaps every time or to precompute them as well.  The advantage of precomputing the TreeMaps is that if you get multiple requests for the top 10, top 5, top 50, etc., you can just keep reading the data structure again and again because it has already been built.  We’ll return the top N ranked usernames and counts as a List of Strings – your choice whether to use ArrayLists or LinkedLists.

To make testing easier, included is a small file with fake data that is easily readable and analyzable by hand.

WebLinks.java has a main method that should give you useful output when run (and you’ve implemented the necessary code).

Programming (70 points):

You should use good style, whitespace, naming conventions, etc., and comment your code.

You may not add any other imports or change the signatures for any existing methods .  You may add other private static methods and fields.  We should still be able to test your code by creating a Scanner, calling processLinks, and then various methods.  You are not required to use LinkData and hashmap and may change them.

You may modify main, but take advantage of the comments showing the expected output that I provided.

1) Finish implementing processLinks to preprocess the links into counts of their reverse lookups.  Use the getName method to extract the username from the URL and see if it’s their home folder.  You should never store the links as a whole (they’re really large) and you should only read the data once.

2) Implement the four numLinksFrom… methods.  There will be some repetition here.

3) Implement the four topRanked… methods.  There will be some repetition, but you can extract the printing the top N from the TreeMap as a common helper method.  If N is bigger than the number of entries, then just stop at the number of entries.  If there are ties, the exact ordering for the ties is undefined, and return the first N of whomever is lucky.

Turn in your work electronically from the “assignments” link on the class webpage.