Project #2: Adding Scalability to Crawlers

Due Dates:
Phase #1: Wednesday, February 7, at 5pm.
Phase #2: Wednesday, February 14, at 5pm.

Caution: Be sure to read the entire assignment from start to finish before you start working on it. This is especially critical, since the assignment has two stages and a bad choice early on could lead to bad results and/or having to rewrite substantial portions of your code later.

Objective: Enhance your existing web crawler by making it more scalable, i.e. able to efficiently store and retrieve data for a large number of MP3 files.

Constraints: On this and subsequent project where you will need to use SQL Server 2000 you will need to work at the Windows 2000 in the CS labs. The reasons are many, among them: since we will be storing the databases on the server, you will not be able to access them from anyplace else; since database connections come with the appropriate JDBC drivers, using even a slightly different platform would require the use of a different driver (which we do not support for this class).
Note: It is possible that some machines in the CS labs may not allow proper installation of the necessary JDBC driver. We have tried on many and have predominantly been successful, but be aware of that potential problem. System administrators have been alerted about the problem and have been looking at it. If and when more news comes, we will immediately inform you. Also, be aware that there may be contention with CSE444, who are using the same client-server infrastructure and, naturally, the same machines.
Bottom-line: Start early and do not wait for the deadlines to loom, since delays due to unavailable/misconfigured equipment are a possibility.
The alternative: if you're dying to work from elsewhere, you can do this at your own risk (not relying on support from us), but you should know that there will be difficulties too: creating and managing your own SQL Server database, configuring it appropriately with the necessary drivers, and finally making everything compatible to the setup here at the CS labs (since that's where we would be testing), etc.

Groups & Collaboration: The same groups from Project #1 will continue working together on this project. As always, each group member will be responsible to know and be able to explain what other group members are doing or have done. Discussions among different groups are permissible, although directly copying another group's work (architectures, diagrams, or code) is considered cheating, a very serious offense.

Getting Ready

Rather than managing an enormous file, containing information about MP3 files your crawler discovers, a more scalable approach would be to store this data in a (relational) database. For this project we will use SQL Server 2000. Details follow immediately.

We will create one database per group. Group members will be able to connect to their database through usernames and passwords which will be e-mailed to you shortly, along with other necessary information. Keep that e-mail for future reference.

Register your SQL server on your machine. (If you switch machines, first make sure the appropriate settings are present on your new machine.)

Create a DSN on your machine. (If you switch machines, first make sure the appropriate settings are present on your new machine.)

Once you have accomplished the previous crucial steps, you should play with your database. To save you the excitement of wondering how and where to start from, we give you a sample Java source file, doing a little bit of this. Experiment connecting to your database on the server and doing the usual database operations: create, select, update, etc. Online documentation will teach you any details.
Note: Here are in subsequent steps, obey the politeness principle - do not disturb live outside servers solely for testing purposes.

Phase #1 - Incorporating Programmatic Database Access to Crawlers

Your goal for this week-long part of the project is to:

crawl our test web server;
extract the links to all MP3 files found there;
do appropriate document parsing to determine the artists and titles of those tracks;
store the extracted information into your database.

To accomplish this, you need to solve at least three technical problems:

Parse the text surrounding the MP3 links, searching for clues of artist and title. You can look at how the text is structured - this will be a good start. However, in light of the more challenging Phase #2 of the project we encourage you to strive towards making your parser as generic as possible - other documents you find on the web are unlikely to have the exact same format, and you ideally want to be able to reuse your parser over a sizable number of crawled sites.
Design a database schema (i.e. tables and inter-relationships between their attributes), appropriate for the data you will be extracting from the parsing step. For this phase of the project, the database schema will be quite trivial, but this will not be the case if you want to make it robust - such that even if you were extracting data from web sites unseen before, you would still be able to store it in your database and not be inconsistent with the agreed format.
To give you a flavor of what problems you might face, consider the situation whereby some sites may not keep explicit information about genre, or about track releases, or even about album titles, let alone missing references about the artist or song-title. What would you do if any of these were missing? You cannot omit them simply because some will be primary keys in your tables. This is where the exciting part of the project will start! But this is Phase #2. Take a look at the Database Schema Design Helper for ideas on how to approach the problem.
Incorporate into your crawler the appropriate Java functionality that will allow you to search and modify attributes and tables in the database. The approach helps crawlers do away with the inherently non-scalable storage in files on a disk or even in tables in main memory.
Warning: It is crucial to have successfully completed all steps from the Getting Ready section before you proceed any further.
In principle, what you need to do is simple:
- writing to a file is replaced by enter records into the appropriate DB table;
- searching for a duplicate in a file is replaced by searching in a DB table;
- etc.
If you have diligently done all the preparation, this should indeed be simple, albeit technical and depending on the complexity of your crawlers from Project #1. For a start, look at the example of Java source code we provided - it does execute one embedded SQL statement in Java; others can be done similarly.

Note: If you finish early, you may proceed to the next phase of the project - there is substantially more challenge in completing it; starting early will give you more time to come up with ingenious solutions, which can make the difference between an average crawler and a great crawler!

Phase #2 - Scalable Crawling of Real MP3 Sites

In this second week-long part of the project, you should pick one of the two suggested paths and do your best. Note: They are not equal in complexity.
Option A is much easier, choosing and completing it would give you less credit (comparable to a grade in the range of 3.0 - 3.6 on this project).
Option B is an open-ended problem and the credit you can receive for it is higher (up to a grade of 4.0 on this project), but the difficulty also rises considerably.
It is very important that you properly assess the time you can afford put into this, as well as the strong and weak points of the members of your team. Completing Option A perfectly will be considered a better achievement that doing work on Option B, but not reaching any conclusive state. Still, we want to assure you that partial credit and extra credit will be awarded generously for your efforts and accomplishments.

Option A - Building MP3 Site Wrapper

Should you pick this option, your goal would be to build a "wrapper" of three known web sites (of your choice), providing links to MP3 files. You can therefore customize your database schema and optimize your document parsers to match the format of these web sites.
A wrapper in this case is just a restricted (or targetted) crawler for only these few sites. It will request documents only they contain and will parse the MP3 links and surrounding text as necessary in order to retrieve as much interesting and relevant data - from the viewpoint of a potential user of the crawler. You should at least extract track titles and artist names. Other useful information might include genre, track release, track size (in MB), other artist-related info, etc.

Option B - Through Heuristics Toward Higher Efficiency and Accuracy

Deciding to pursue this option, your crawlers are destined to crawl the web, rather than be restricted to servicing just a few MP3 sites. The rewards of this are numerous, including finding music that you actually want to listen to (which you are less likely to find on large MP3 donor sites). The price to doing that is accordingly high: you should think about solving a host of problems, which you will come across. You should consider ways to:

control the search. This includes not going too deeply into any one search path, that might lead to the undesirable effect of exploring only a narrow set of sources;
extract accurate identifying information for each MP3 link. This includes applying heuristics (among those studied in class, or others of your own) to separate out key attributes from other irrelevant information in the retrieved document;
deal with missing data. This might be observed as incomplete information, that some sites give about their MP3 files - mostly a missing key attribute.

Following are some ideas on how you might deal with some of these problems.

Look at anchor text as well as text surrounding the MP3 links;
Look at the actual MP3 filename - often it follows a common name scheme that includes the artist name as well as the song title;
Look at the information contained within the MP3 file itself (the ID3 tags, briefly discussed on the class mailing list) - the format is well documented online. Warning: You may use this idea only as a last resort, rather than for all MP3 files, since it involves downloading files and therefore is not scalable. A very important thing is to constantly be aware of the space constraints on our server - it only has about 100MB total per group. If you exceed this limit at any point, you should think about how to minimize your usage, otherwise problems will quickly swamp not only you, but everyone else using the server and clearly you would be at fault.

Be sure to think of others - they are likely to pay off. We cannot wait to experience the results together with you!

Database Schema Design Helper

Wondering where to start designing your database schema? Think about the following issues:
- What will the information I need to store look like?
- What robust format would allow me to avoid problems with missing fields (especially missing primary keys - something, which should never happen)?
- How many tables should you create?
- What attributes will each table have?
- What data type is appropriate for each attribute?
- Which attributes should have indexes on them?

For the second phase of the project, consider various database schema alternatives. If a group member has experience in databases, they will have studied normalization procedures and should be well-versed in the schema design issues. However, this is not absolutely necessary at this step: since the project is relatively small, it is not hard to come up with a reasonable schema design by following one's intuition. You should feel free to implement any design you feel best suits your needs. Here is one possible idea with a few thoughts attached - the kind you should be asking yourself when making your own table designs.
Let us consider the following two tables:
- band(name, location, genre)
- track(title, band, size-in-mb, source, url)
where name is the key attribute (presumably unique) for the band relation, while title and band may be the key attributes for the track relation. The band attribute in the track relation and the name attribute in the band relation provide the "joining" point between the two relations. It is always wise to have indexes on the keys - name, title and band; also it often helps tremendously to have indexes on secondary attributes, on which searches are frequent and therefore need to be fast. In the proposed schema, a natural candidate for this will be the genre attribute.
Is this schema robust? Does it have potential weaknesses?
Since bands often release multiple recordings (e.g., studio, live, etc.) of a given track, it seems likely that one needs to specify a title, a band, and a date in order to uniquely identify a specific track. Therefore the table might be augmented in order to resolve such a potential conflicts. Also, if there is other relevant data - lyrics, etc. that you choose to store, this will require yet another set of fields in your tables. The possibilities are many, the sky is the limit! If you want to be extra careful, consider adding another text field where any "ill-formed" data (which does not match the relational schema you have defined) could be stored. Or maybe you could just discard such extraneous information - it is a valid design choice.
How about additional tables? They could prove quite useful, especially in the case of temporary tables, e.g. containing URLs and/or robots.txt files of web sites already visited. This will not be a justified approach if you can keep those in memory, but with too many links you will have to either discard old ones, or store them in a database. The design decision is again up to you.
Common sense suggests that simple design is always best. Extra credit will be awarded for justified ambitious and prudent efforts!

What to Hand In

Hand in the URL of a top-level web page that lists the team name, team members and explains:

For Phase #1:

A concise description of your database schema and the rationale behind it.
A pointer to the table(s) in your database, where your crawler stored the MP3 links, extracted from the test server.
The source code of your crawler with incorporated database functionality.

For Phase #2:

Which option did you choose to work on?
(Option A only) Which three MP3 donor sites did you choose and why?
A complete description of your (latest version of the) database schema and an explanation of the rationale behind it.
A pointer to the table(s) in your database, where your crawler stored the extracted MP3 links.
The source code of your crawler.
(Option B only) A coherent discussion about how you solved the missing attribute problem, the extra attribute problem (how you dealt with attributes for which you had not allocated fields in the tables), heuristics used for identifying various attributes and what the accuracy was, the policy of your crawler as far as determining which links to be examined first, other notable features of your crawler, which you want to share with us.

Note: If you get stuck or can not complete every part of the assignment, do as much as you can. Partial credit and extra credit will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we will give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment.

Additional Useful Pointers

Xerces XML parser documentation
SQL tutorials
JDBC-related tutorials:
Java.sun.com has good documentation on how to use the JDBC interface.

Good luck!

Valentin Razmov | valentin@cs.washington.edu