Project #1: Crawling the Web

Due Date: Thursday, January 25, in class.

Caution: Be sure to read the entire assignment from start to finish before you start working on it; there are numerous cases where a bad design decision early could lead to serious problems at later stages.

Objective: Build a cool music search site - one that offers comprehensive music information. See Gigabeat for a possible role model.

Groups & Collaboration: You are encouraged (though not required) to work in groups on this project. You need to e-mail your TA with information about who will work in which group by Tuesday, January 16. (One e-mail per group is requested.) Groups help brainstorm ideas more effectively, reduce workload and increase efficiency. However, no group may have more than 3 members. Every group member will be responsible to know and be able to explain what other group members are doing or have done. Discussions among different groups should be limited. Directly copying another group's work (architectures, diagrams, or code) is considered cheating, a very serious offense.

Policies

In this project you're going to construct a web crawler that automatically traverses the web, extracting and storing information about MP3 files. For now, do not concern yourself with fancy user interfaces and/or storing data in databases - you will be asked to add some of that functionality as part of follow-on projects.

The first step in building a successful web crawler is to learn the proper etiquette. Virtually all content providers frown on misuse of their resources by poorly written crawlers that try to download big chunks of their sites all in the space of only several seconds. It might even be the case that large numbers of hits from the same site look like a coordinated security attack! So the foremost principle you should be thinking about in this project is how to be a well-behaved net citizen. A few important guidelines to follow are to make sure that your (automated) crawlers:

space out requests to any given server and do not request the same content over and over again!
never hit any given site more than a few times a minute (4 times seems like an absolute maximum here and even that may be too much)!

To help you and your crawler learn the good manners, we will provide a web server (details to follow), on which you can do limited experiments to the point where you are confident that your code works unobtrusively. Only then you can proceed to hitting real web servers and be sure that they will not go after you later. Correctness and safety considerations come first, efficiency - only after that!

In addition, content providers might wish to limit access to certain parts of their sites to automated spiders. To this end, the robots.txt protocol was created, so that site owners can specify which parts of the site are accessible to which web robots. A good place to start reading about this is the Web Robots Pages. There you will find links to the robots.txt standard, a list of active robots, and a FAQ. To view a quick example, the CSE department has its own robots.txt file.

We recommend that you write an API layer on top of HTTP which ensures that you never violate frequency of connection or other robots.txt principles. Failure to observe these principles could have serious consequences:

Your account could get turned off by the CSE lab (this has happened in the past for exactly the same reason: complaint by an outside webmaster).
The class professor and TA could get reprimanded by the Department Chair.
If your account is disabled, it will be very difficult to do work, so you'll likely turn in a poor project and get a bad grade. Don't expect an extension.
If the professor and TA get in trouble because of your carelessness, how do you think they will grade your project when end of quarter comes?

Crawling the Web

Once you have read up on how to write a well-behaved robot, you can start thinking about how you are going to search the web. Where will you start? How do you extract the links from a web page? Once you have them, how will you decide which links to follow? How will you extract MP3 information from a web page once you have found it?

Since the dominant language on the web today is HTML (and XML to a much smaller extent), you will need to be able to parse HTML files. Various HTML parsers exist for practically every language. Java has (at least) one built-in: see javax.swing.text.html.parser.DocumentParser for more details. This interface allows you to register callbacks that are called when a tag is opened or closed; you can then build up a tree-node representation of the document, or process only certain tags, as necessary. Of course, you are welcome to use any HTML parser you wish (provided you give proper credit to its authors) or build one yourself.

Having parsed a given HTML document, you should be able to extract links to other (HTML) web pages from it. At this point you have a choice which link(s) to follow first, whether you should perform breadth-first or depth-first search. You could use an existing search engine like Altavista or Hotbot to find pages that are likely to contain MP3 files and then start your search from there on.

Site Features and Extracting Content from the Web

In addition to the links it provides, each web page you parse could also contain relevant information you might wish to store.

One kind of music site is typified by mp3.com. It lists plenty of information about each artist, such as a history of the artist, pictures, links to CDs for sale, etc. You can browse the collection of artists and tracks by genre or by artist, find other similar songs, see the list of top requested songs, and generally wander around the site browsing all of its rich content. On the other hand, it's likely that if you're searching for one particular popular song, you're not likely to be able to find that song on mp3.com.

The other kind of site is exemplified by Gigabeat.com. Instead of having a broad browsing-friendly interface, gigabeat gives you a single search box. Type in the name of your favorite artist or song, and you'll probably find a link to the (undoubtedly pirated) MP3 file. Gigabeat trades off information quality for completeness. They probably have an automated crawler that looks for mp3 files on the web and extracts artist and song title information---just enough data for people to search on.

Your own creation can draw from any of these features, or make up your own. If you're looking for music, what kind of site would you like to visit? What kind of information do you want to see? We're giving you the opportunity to design and build exactly the kind of music resource you wish there was on the web! A useful idea is to try and build a:

Site-independent MP3 crawler. The idea here is to aggregate content from thousands of different sites, extracting Gigabeat-style snippets of information and MP3 links, using heuristic methods to identify promising data. One way to approach this problem would be to look for pages that contain links to MP3 files and then heuristically try to classify each such link by assigning an artist and song description to it. How would you go about doing this? You could start by looking at the text surrounding the link. Consider all the words in the link text and in the text immediately surrounding the link. Then look up all these words or phrases in a list of bands, such as the the Ultimate Band List, and see which words and phrases are valid band names. There is much room for experimenting here. But do not worry about the absolute accuracy of the information - it is a hard problem and it is not our primary goal to solve it completely at this point.

The site-independent MP3 crawler approach should be able to get a much more comprehensive and interesting collection of songs than any potential alternative site-specific approach. One disadvantage is that it is more likely to result in noisy (i.e. incorrect) data.

Note: Our grading system will take into account the difficulty of what you attempted, so do not worry about experimenting something new. However, we suggest that you first check with us if you think you have especially bold ideas you wish to pursue - they may not be possible to bring to completion in only a few weeks' time.

Scaling up the Crawler

Let us now imagine you already have a functional crawler. To be able to build up a large collection of data quickly, a good idea is to "parallelize" your crawler so that it visits sites simultaneously. You can do this by using threads.

Caveat: With multiple threads running at the same time it is much easier to forget the much needed policies and start sending too many requests (the total from all your threads) to a given site in a short time interval. Be careful not to fall victim to this mistake!

Project Specifics

We provide a working crawler, which you are expected to extend. You may choose to write your own "core" crawler (in either Java or C++), but be aware that there is plenty of challenge in this project, so you might be incurring too much work on yourself.

Lucene is an open-source, cross-architecture, high-performance text search engine, written in Java. Here is a comprehensive description of how to download, install and then run it. You would want to do some experiments in order to familiarize yourself with the way it works and what it gives you. The way it is, Lucene searches within the files on your local disk. You should add network support (i.e. sending and receiving data over HTTP - this is really easy in Java) to start crawling the world outside. Then, you would need to modify the existing code to look for MP3's (be careful: this may be more complicated than in looks at first). Make sure that you experiment on our web server (details to be announced soon) - that is what we have put it for - and only crawl elsewhere on the web once you are certain you're doing the right thing!

When you have already implemented the functionality, run the crawler on real net data to obtain the set of metrics, asked for in the next section. Then, proceed to implement some performance improvement idea (of your choice) as suggested in the quoted paper.

What to Hand In

Hand in the URL of a top-level web page that lists your team name and contact information for each member. In addition, the web page should provide links to:

A description of your crawler. How does it work? Discuss the method for ensuring the safety of target web sites, the web search strategy, the approach to extraction of content, as well as the kinds of problems you faced and how you attacked them.
The full source code of your crawler, along with instructions (in a README file) about building and running it.

A set of metrics on your crawler's performance: number of links followed, number of pages visited, number of MP3's found, number of duplicate MP3's found, total CPU time consumed. Compare those metrics your core crawler architecture vs. the enhanced crawler architecture containing some improvements, as suggested in the Mercator paper. Calculate two more metrics - CPU time per MP3 found and number of visited pages per MP3 found. Organize and present your results in a table of the following format:

Core Crawler Enhanced Crawler

CPU time consumed

Number of links followed

Number of pages visited

Number of MP3's found

Number of pages per MP3 found

CPU time consumed per MP3 found

The output text file, containing the information collected by the crawler, organized around a Location/Title/Artist format or similar.

Note: If you get stuck or can't complete every part of the assignment, do as much as you can. If you try an ambitious method for information extraction, we understand you may not have as much time for other parts and will take this into account. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment.

Additional Useful Pointers

Instructions, hints and troubleshooting ideas (to be added as we make progress)

Good luck!

	Core Crawler	Enhanced Crawler
CPU time consumed
Number of links followed
Number of pages visited
Number of MP3's found
Number of pages per MP3 found
CPU time consumed per MP3 found