Project #3: Adding Web Interfaces to Crawled Data
Phase #1: Wednesday, February 28, at 5pm.
Phase #2: Monday, March 5, at 5pm.
Caution: Be sure to read the entire assignment from start to finish before
you start working on it. This is especially critical, since the assignment has two stages
and a bad choice early on could lead to bad results and/or having to rewrite substantial
portions of your code later.
Objective: Develop web interfaces to your crawled data, making the process of
querying for data user-friendly. In particular, your goal is to provide a way to
conveniently search for MP3 files and any pertinent information, which has been extracted
into your database during a previous web crawl.
Constraint: Since you will need access to your SQL database and SQLServer,
the need to work from the CS labs will remain in effect.
If you have made previous arrangements and have been working on another platform, porting
the deliverables to our configuration, you may continue doing that, but keep in mind that
you will be on your own as far as confronting implementation problems, related to your
Groups & Collaboration: The same groups from Project #1 and #2 will continue
working together on this project. As always, each group member will be responsible to
know and be able to explain what other group members are doing or have done. Discussions
among different groups are permissible, although directly copying another group's work
(architectures, diagrams, or code) is considered cheating,
a very serious offense.
What to Do
One of the advantages of storing your MP3-related data in a relational database is
the ability to efficiently process (SQL) queries on web-crawled datasets. In this part
of the assignment, you should create a front-end HTML page containing one or more HTML
forms, allowing users to:
- (Phase #1) search your database and get results displayed in HTML;
- (Phase #2) register with your web site and only be able to search if they
have previously registered.
- Users (only registered users in Phase #2) should be able to search your
database by specifying:
- An artist name, and obtaining all tracks available for that artist;
- A track title, and obtaining all tracks matching the specified title;
- A keyword, and obtaining all tracks or artists or both (as specified by the
user) containing the specified keyword. Extensions to this are possible - if you
think it is going to increase the attractiveness of your interface, go for it;
- All artists, and obtaining a sorted list of the names of all artists,
for whom data is available in your database. Be creative as to how to make this
list convenient to browse - one idea might be to have a web page, containing an
index with letter ranges, e.g. A-C containing all artists whose first letter is
between A and C, etc.
- (Optional) An artist name, and obtaining the best three web sites to search
for this particular artist. Here "best" could be defined as sites yielding the
largest number of MP3 links for the given artist, or sites containing the most
authoritative and up-to-date information for that artist. Be creative here, but
make sure you explain in your writeup what exactly you did and why you decided so.
This option can lead to very interesting extensions worth extra points!
Feel free to augment the queries we have asked for by adding more interesting ones. Extra
credit will be given for useful and thoughtful extensions!
- (in Phase #2) New users should be able to register on your site.
The information requested on the registration form should include:
- first name;
- last name;
- e-mail address;
- phone number;
If a user is already registered, he should automatically be identified using the
cookie she was sent at the time of registration. This avoids the trouble of users
having to re-register every time a new request is submitted or a page is refreshed.
are a light-weight mechanism used to augment the stateless nature of the HTTP
protocol, which as conceived keeps no "memory" from one interaction with a user to the
next. Cookies are small pieces of data, stored on local hard-drives at the client, to
avoid overburdening the server with registration information and to ensure scalability
of services for a large number of clients. Two important properties of cookies are that
they are given out once per server domain and they have an expiration date and time.
When a user requests new data, a cookie is retrieved from the local machine
(if available) and sent along with the request, providing the needed credentials to
the server and avoiding the necessity to re-register. While cookies are no replacement
for comprehensive security mechanisms (which typically are much more resource intensive),
they provide a simple way to keep track of users in a more open environment, such as
a site allowing access to freely available MP3 files.
Accessing Databases from the Web
Traditionally, programmers have implemented database search via web forms that connect
to a CGI script on the server, then the script performs the query to a database and
returns the results as HTML, which the user sees. In class, we have discussed the
performance penalties of this approach. A reasonable alternative is to write a Java
servlet which connects to the database via JDBC. Some prefer to use PHP, JSP, Perl or
still other scripting languages; these approaches may be faster to implement and get to run,
but all of them tend not to scale so well. While we encourage you to use
Java servlets to accomplish the goal of connecting
to your database from a web form, you are free to explore other avenues. Make sure to
explain clearly in your writeup what you did and why you chose that approach.
Here is a list of hints about making servlets work for you.
Relevance of Returned Data
To keep users satisfied, the returned results need to be both relevant and accurate as a
response to their queries. Similar to what some search engines do, you should develop a
ranking function that can differentiate good results from suspicious ones.
For instance, you may wish to rank based upon level of confidence in the retrieved
If with a high degree of certainty you can claim that a particular link leads to the track
or artist which the user requested to see, then this is a good result. On the other hand,
if the artist name was retrieved using heuristics that did not guarantee correctness of the
data, then this is a result that should not be returned as a first choice.
Another criteria you may wish to rank your results on is availability. The fact that
you found a link to a particular track does not mean that this link is indeed still
available, neither does it imply that it leads to a song you can retrieve quick enough.
Measuring server latencies and checking for MP3 availability are very useful add-ons to a
real usable system. They do not require much additional work on your part, but help
maximize user satisfaction.
Note: You may need to augment your database schema to support these and other
Since what the user sees will define her experience with your crawler, your interface
needs to be well thought and tried out. You may want to take a look at a few professional
sites and see how they tried to solve this problem.
Users will interact with your database by asking it questions, which will be translated
into queries. The results of the queries that the user is interested in will be dynamically
generated, therefore in order to report them you need to be able to generate web pages
on-the-fly (as opposed to static web pages) and display them. In addition, it would be
useful to validate data in the HTML form itself, before a query is sent out to the
database, e.g. checking for incomplete entries, type errors (such as a phone number not
following the normal phone number format, etc.)
all modern browsers, which allows you to do just that - process inputs to a form, as well
as dynamically create and update web pages.
Web Site Features
Depending on which option you chose to work on for Project #2, your choice of what
interface features to give the user may differ significantly.
If you chose to build wrappers (Option A) of a few known MP3 donor sites (such as
these may offer you the opportunity to search for various interesting trivia about each
artist, such as artist history, pictures, links to CDs for sale, lyrics, etc. Such sites
would typically allow you to browse by genre, locate the list of top hits in any category
as well as other features many of which you might not be able to ask for if you chose to
build your own crawler (Option B).
The trade-off is that with popular MP3 donor sites you may not be able to locate anything
particular you wished for, such as your favorite song if it is by a popular artist.
If you chose to build your own crawler with a variety of heuristics to enable it to find
what you are actually looking for (Option B), you would be similar to what
Instead of having a broad browsing-friendly interface, gigabeat gives you a single search
box. Type in the name of your favorite artist or song, and you are likely to find a link
to the (undoubtedly pirated) MP3 file. The trade-off again is between information quality
and completeness - gigabeat-style sites would barely give you song titles and artist names,
but will do so across a much broader set of tracks and artists available on the web.
Your own creation can draw from any of these features, or a combination of them. It will
however be constrained by what you chose to implement in previous projects. The good news
is, we have teams working on both options, so in the end we will see what user experience
this will provide!
More on Usability
Pay attention to the user interface of your search system. Think through usability
carefully. We recommend getting some friends, asking them to use your site and watching
(in person) as they do. Keep quiet and do not tell them anything, just watch what
they do and if they have problems. This will teach you a lot no one else can tell you.
Note: Part of your grade will depend on the cleanliness and usability of your
For extra credit, think about ideas on how to extend your crawler in some useful and
creative way. Since this will be the final piece you will build for your web crawlers in
this class, you might as well make them be what you really want them to be!
It is your personal choice which specific area of your crawler to try and improve.
Depending on the team goals (including which option you chose on Project #2) and your
results so far, you may choose differently. We will be happy to discuss with you any
suggestions you might have.
What to Hand In
Hand in the URL of a top-level web page that lists the team name, team members and:
- for Phase #1:
- Gives a description of your site design and the rationale behind it.
What features does your site support? What are the constraints that limit the
ability of the user to find what she may be looking for? Does it cover in-depth
and complete information about a small number of artists or does it allow the user
to search through a broad class of artists, but giving relatively little data on
- Discusses in detail any extra features you have designed and implemented.
- Links to your search form, the code which drives it and a brief explanation of
how it works.
- Provides a testing guide and a short list of sample queries that show off
important features of your front-end.
- for Phase #2:
- Gives an updated description of your site's design and the rationale
behind it. What new features does your site support? What are the constraints
that limit the ability of the user to experience what she may be expecting from
a site like yours?
- Discusses in detail any extra features you have designed and implemented.
- Links to your updated search form, the enhanced code which drives it and
provides a brief explanation of what is new.
- Provides a testing guide as well as an updated short list of sample queries
that show off important features of your front-end.
- Tells us about your group's experience from trying to find music you wanted
you wanted via your front-end. Was it successful? What were the advantages and
the downsides that you or other users of your site noticed?
- Briefly accounts what you would do differently in hindsight and/or what you
would have liked to do in addition to your latest functionality, if only you had
Note: If you get stuck or can not complete every part of the assignment, do as
much as you can. Partial credit and extra credit will definitely be awarded. If a bug or
software glitch gets you, let us know as soon as possible (we will give credit for
finding these, and even more credit for finding a solution or workaround) - but keep
working on other parts of the assignment.
- JDBC and related:
- Aaron Weiss,
- Steven Speinhour & Robert Eckstein, Webmaster in a Nutshell,
2nd ed., O'Reilly Publ., June 1999.
- Danny Goodman, Dynamic HTML: The Definitive Reference, O'Reilly Publ., Aug 1998.
- Java servlets:
Valentin Razmov | email@example.com