Project #3: Adding Web Interfaces to Crawled Data


Due Dates:
    Phase #1: Wednesday, February 28, at 5pm.
    Phase #2: Monday, March 5, at 5pm.

Caution: Be sure to read the entire assignment from start to finish before you start working on it. This is especially critical, since the assignment has two stages and a bad choice early on could lead to bad results and/or having to rewrite substantial portions of your code later.

Objective: Develop web interfaces to your crawled data, making the process of querying for data user-friendly. In particular, your goal is to provide a way to conveniently search for MP3 files and any pertinent information, which has been extracted into your database during a previous web crawl.

Constraint: Since you will need access to your SQL database and SQLServer, the need to work from the CS labs will remain in effect.
Note: If you have made previous arrangements and have been working on another platform, porting the deliverables to our configuration, you may continue doing that, but keep in mind that you will be on your own as far as confronting implementation problems, related to your specific setup.

Groups & Collaboration: The same groups from Project #1 and #2 will continue working together on this project. As always, each group member will be responsible to know and be able to explain what other group members are doing or have done. Discussions among different groups are permissible, although directly copying another group's work (architectures, diagrams, or code) is considered cheating, a very serious offense.

What to Do

One of the advantages of storing your MP3-related data in a relational database is the ability to efficiently process (SQL) queries on web-crawled datasets. In this part of the assignment, you should create a front-end HTML page containing one or more HTML forms, allowing users to:

More specifically:

  1. Users (only registered users in Phase #2) should be able to search your database by specifying:

    Feel free to augment the queries we have asked for by adding more interesting ones. Extra credit will be given for useful and thoughtful extensions!

  2. (in Phase #2) New users should be able to register on your site.
    The information requested on the registration form should include:

    If a user is already registered, he should automatically be identified using the cookie she was sent at the time of registration. This avoids the trouble of users having to re-register every time a new request is submitted or a page is refreshed.
    Cookies are a light-weight mechanism used to augment the stateless nature of the HTTP protocol, which as conceived keeps no "memory" from one interaction with a user to the next. Cookies are small pieces of data, stored on local hard-drives at the client, to avoid overburdening the server with registration information and to ensure scalability of services for a large number of clients. Two important properties of cookies are that they are given out once per server domain and they have an expiration date and time. When a user requests new data, a cookie is retrieved from the local machine (if available) and sent along with the request, providing the needed credentials to the server and avoiding the necessity to re-register. While cookies are no replacement for comprehensive security mechanisms (which typically are much more resource intensive), they provide a simple way to keep track of users in a more open environment, such as a site allowing access to freely available MP3 files.

Accessing Databases from the Web

Traditionally, programmers have implemented database search via web forms that connect to a CGI script on the server, then the script performs the query to a database and returns the results as HTML, which the user sees. In class, we have discussed the performance penalties of this approach. A reasonable alternative is to write a Java servlet which connects to the database via JDBC. Some prefer to use PHP, JSP, Perl or still other scripting languages; these approaches may be faster to implement and get to run, but all of them tend not to scale so well. While we encourage you to use Java servlets to accomplish the goal of connecting to your database from a web form, you are free to explore other avenues. Make sure to explain clearly in your writeup what you did and why you chose that approach.
Here is a list of hints about making servlets work for you.

Relevance of Returned Data

To keep users satisfied, the returned results need to be both relevant and accurate as a response to their queries. Similar to what some search engines do, you should develop a ranking function that can differentiate good results from suspicious ones. For instance, you may wish to rank based upon level of confidence in the retrieved results. If with a high degree of certainty you can claim that a particular link leads to the track or artist which the user requested to see, then this is a good result. On the other hand, if the artist name was retrieved using heuristics that did not guarantee correctness of the data, then this is a result that should not be returned as a first choice.
Another criteria you may wish to rank your results on is availability. The fact that you found a link to a particular track does not mean that this link is indeed still available, neither does it imply that it leads to a song you can retrieve quick enough. Measuring server latencies and checking for MP3 availability are very useful add-ons to a real usable system. They do not require much additional work on your part, but help maximize user satisfaction.
Note: You may need to augment your database schema to support these and other extra features.

Dynamic Web Pages & JavaScript

Since what the user sees will define her experience with your crawler, your interface needs to be well thought and tried out. You may want to take a look at a few professional sites and see how they tried to solve this problem. Users will interact with your database by asking it questions, which will be translated into queries. The results of the queries that the user is interested in will be dynamically generated, therefore in order to report them you need to be able to generate web pages on-the-fly (as opposed to static web pages) and display them. In addition, it would be useful to validate data in the HTML form itself, before a query is sent out to the database, e.g. checking for incomplete entries, type errors (such as a phone number not following the normal phone number format, etc.)
JavaScript is a scripting language supported by virtually all modern browsers, which allows you to do just that - process inputs to a form, as well as dynamically create and update web pages.

Web Site Features

Depending on which option you chose to work on for Project #2, your choice of what interface features to give the user may differ significantly.

If you chose to build wrappers (Option A) of a few known MP3 donor sites (such as mp3.com), these may offer you the opportunity to search for various interesting trivia about each artist, such as artist history, pictures, links to CDs for sale, lyrics, etc. Such sites would typically allow you to browse by genre, locate the list of top hits in any category as well as other features many of which you might not be able to ask for if you chose to build your own crawler (Option B).
The trade-off is that with popular MP3 donor sites you may not be able to locate anything particular you wished for, such as your favorite song if it is by a popular artist.

If you chose to build your own crawler with a variety of heuristics to enable it to find what you are actually looking for (Option B), you would be similar to what gigabeat.com does. Instead of having a broad browsing-friendly interface, gigabeat gives you a single search box. Type in the name of your favorite artist or song, and you are likely to find a link to the (undoubtedly pirated) MP3 file. The trade-off again is between information quality and completeness - gigabeat-style sites would barely give you song titles and artist names, but will do so across a much broader set of tracks and artists available on the web.

Your own creation can draw from any of these features, or a combination of them. It will however be constrained by what you chose to implement in previous projects. The good news is, we have teams working on both options, so in the end we will see what user experience this will provide!

More on Usability

Pay attention to the user interface of your search system. Think through usability carefully. We recommend getting some friends, asking them to use your site and watching (in person) as they do. Keep quiet and do not tell them anything, just watch what they do and if they have problems. This will teach you a lot no one else can tell you.
Note: Part of your grade will depend on the cleanliness and usability of your interface.

Extensions

For extra credit, think about ideas on how to extend your crawler in some useful and creative way. Since this will be the final piece you will build for your web crawlers in this class, you might as well make them be what you really want them to be!
It is your personal choice which specific area of your crawler to try and improve. Depending on the team goals (including which option you chose on Project #2) and your results so far, you may choose differently. We will be happy to discuss with you any suggestions you might have.

What to Hand In

Hand in the URL of a top-level web page that lists the team name, team members and:

Note: If you get stuck or can not complete every part of the assignment, do as much as you can. Partial credit and extra credit will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we will give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment.

Additional Useful Pointers

Good luck!


Valentin Razmov | valentin@cs.washington.edu