CSE373—Data Structures and Algorithms
for Nonmajors
Programming Assignment #3
due: Friday, 2/15/08, 10:00 PM
Introduction:
The problem of reverse
lookup is one frequently encountered when working on problems when data maps to
data. One example is a phonebook, which
is a map of names to phone numbers. If
you want to identify the name corresponding to a phone number, you are out of
luck, unless you search through potentially all entries in the phonebook. However, often you don’t just want to compute
just one reverse lookup – you want to do many.
An alternative is to precompute all reverse lookups into a data
structure and query the data structure later.
The data structure that
works really well for this query is the HashMap. The phonebook itself is an example of a
TreeMap that maps names to phone numbers (alphabetical order). Order doesn’t matter for the reverse lookup,
so we can get optimal runtime by mapping phone numbers to names with a hashmap
(O(n) to build, O(1) to query). It’s
really cool that no matter how large the phonebook gets, it is still O(1) to
query the reverse lookup.
For this assignment, we
will look at links on webpages. A
webpage consists of html and likely contains several links to other pages
embedded in <a href=http://...> tags.
But it’s hard to find out what webpages are pointing to it, again,
without searching all possible webpages.
One application of finding out who (or how many) link to a webpage is
for part of the process of ranking the importance of webpages. The full idea is that a page is important if
it is linked to by important webpages.
We’ll simplify the notion for this assignment and just count links to
webpages. (To get an idea of how
advanced this can become, think about how you’d have to have a policy for how
often to crawl the web for updates, and think how easy it would be for spammers
to engineer sets of webpages that have high rank, meaning that you need to
detect their trickery).
The first part of the
assignment is to count the number of times a webpage is referenced by another
webpage. We’ll consider four different
types of links: links that are internal to a user’s pages (from me to me),
internal links that point a user’s home (from me to my home), links from other
sites that point to one of a user’s
pages (from not me to me), and links from other sites that point to a user’s
home (from not me to my home). I’ve already done the work of crawling the
CSE webpages and extracting the link information. The processLinks method reads that data and
gets the toURL and fromURL variables established; you need to fill in the rest. There’s a method getName that I wrote that
figures out a URL belongs to a user and if it’s their homepage.
The idea is if there is
a link from p to q, then we’ll have a HashMap<String,Integer> that maps
the toURLs to the different counts (a class with four fields, one per count),
and we’d do something like
LinkData ld = hashmap.get(q);
if (ld==null) hashmap.put(q,ld=new LinkData());
ld.numLinks++;
Later we could query the hashmap to get the
counts.
The next step is to take this data and get
ranks. We just made a map of Strings to
data containing Integers. But to find
out ranks, it would be much nicer to have data of Integers to Strings. This is another reverse lookup problem. We’ll use a TreeMap here because it will give
us the ordering (see the descendingKeySet method).
(oops, descending KeySet in Java1.6... do new
TreeMap(Collections.reverseOrder()) and then later just keySet)
For a HashMap entry, you find the appropriate
number field in the value and use that as the TreeMap key, with the String that
was the HashMap key serving as the value.
When you iterate over the first N items of the descendingKeySet, you can
extract the top N ranked usernames and their respective link counts. Because the set of users is much smaller than
the set of webpages, we’ll leave it up to you whether to compute the TreeMaps
every time or to precompute them as well.
The advantage of precomputing the TreeMaps is that if you get multiple
requests for the top 10, top 5, top 50, etc., you can just keep reading the
data structure again and again because it has already been built. We’ll return the top N ranked usernames and
counts as a List of Strings – your choice whether to use ArrayLists or
LinkedLists.
To make testing easier, included is a small file
with fake data that is easily readable and analyzable by hand.
WebLinks.java has a main method that should give
you useful output when run (and you’ve implemented the necessary code).
Programming
(70 points):
You should
use good style, whitespace, naming conventions, etc., and comment your code.
You may
not add any other imports or change the signatures for any existing methods
. You may add other private static
methods and fields. We should still be
able to test your code by creating a Scanner, calling processLinks, and then
various methods. You are not required to
use LinkData and hashmap and may change them.
You may
modify main, but take advantage of the comments showing the expected output
that I provided.
1) Finish implementing processLinks to preprocess
the links into counts of their reverse lookups.
Use the getName method to extract the username from the URL and see if
it’s their home folder. You should never
store the links as a whole (they’re really large) and you should only read the
data once.
2) Implement the four numLinksFrom…
methods. There will be some repetition
here.
3) Implement the four topRanked… methods. There will be some repetition, but you can
extract the printing the top N from the TreeMap as a common helper method. If N is bigger than the number of entries,
then just stop at the number of entries.
If there are ties, the exact ordering for the ties is undefined, and
return the first N of whomever is lucky.
Turn in
your work electronically from the “assignments” link on the class webpage.