Project #2: Adding Scalability to Crawlers
Due Dates:
Phase #1: Wednesday, February 7, at 5pm.
Phase #2: Wednesday, February 14, at 5pm.
Caution: Be sure to read the entire assignment from start to finish before
you start working on it. This is especially critical, since the assignment has two stages
and a bad choice early on could lead to bad results and/or having to rewrite substantial
portions of your code later.
Objective: Enhance your existing web crawler by making it more scalable, i.e.
able to efficiently store and retrieve data for a large number of MP3 files.
Constraints: On this and subsequent project where you will need to use SQL
Server 2000 you will need to work at the Windows 2000 in the CS labs. The reasons
are many, among them: since we will be storing the databases on the server, you will not be
able to access them from anyplace else; since database connections come with the appropriate
JDBC drivers, using even a slightly different platform would require the use of a different
driver (which we do not support for this class).
Note: It is possible that some machines in the CS labs may not allow proper
installation of the necessary JDBC driver. We have tried on many and have predominantly
been successful, but be aware of that potential problem. System administrators have been
alerted about the problem and have been looking at it. If and when more news comes,
we will immediately inform you. Also, be aware that there may be contention with CSE444,
who are using the same client-server infrastructure and, naturally, the same machines.
Bottom-line: Start early and do not wait for the deadlines to loom, since
delays due to unavailable/misconfigured equipment are a possibility.
The alternative: if you're dying to work from elsewhere,
you can do this at your own risk (not relying on support from us), but you should know
that there will be difficulties too: creating and managing your own SQL Server database,
configuring it appropriately with the necessary drivers, and finally making everything
compatible to the setup here at the CS labs (since that's where we would be testing), etc.
Groups & Collaboration: The same groups from Project #1 will continue
working together on this project. As always, each group member will be responsible to
know and be able to explain what other group members are doing or have done. Discussions
among different groups are permissible, although directly copying another group's work
(architectures, diagrams, or code) is considered cheating,
a very serious offense.
Rather than managing an enormous file, containing information about MP3 files your
crawler discovers, a more scalable approach would be to store this data in a (relational)
database. For this project we will use SQL Server 2000. Details follow immediately.
- We will create one database per group. Group members will be able to connect
to their database through usernames and passwords which will be e-mailed to you
shortly, along with other necessary information. Keep that e-mail for future
reference.
- Register your SQL server on your machine.
(If you switch machines, first make sure the appropriate settings are present on
your new machine.)
- Create a DSN on your machine. (If you switch machines,
first make sure the appropriate settings are present on your new machine.)
- Once you have accomplished the previous crucial steps, you should play with your
database. To save you the excitement of wondering how and where to start from, we give
you a sample Java source file,
doing a little bit of this. Experiment connecting to your database on the server and
doing the usual database operations: create, select, update, etc.
Online documentation will teach you any details.
Note: Here are in subsequent steps, obey the politeness principle - do not
disturb live outside servers solely for testing purposes.
Phase #1 - Incorporating Programmatic Database Access to Crawlers
Your goal for this week-long part of the project is to:
- crawl our test web server;
- extract the links to all MP3 files found there;
- do appropriate document parsing to determine the artists and titles of those tracks;
- store the extracted information into your database.
To accomplish this, you need to solve at least three technical problems:
- Parse the text surrounding the MP3 links, searching for clues of artist and title.
You can look at how the text is structured - this will be a good start.
However, in light of the more challenging Phase #2 of the project
we encourage you to strive towards making your parser as generic as possible -
other documents you find on the web are unlikely to have the exact same format,
and you ideally want to be able to reuse your parser over a sizable number of crawled
sites.
- Design a database schema (i.e. tables and inter-relationships between their
attributes), appropriate for the data you will be extracting from the parsing step.
For this phase of the project, the database schema will be quite trivial, but this will
not be the case if you want to make it robust - such that even if you were extracting
data from web sites unseen before, you would still be able to store it in your database
and not be inconsistent with the agreed format.
To give you a flavor of what problems you might face, consider the situation whereby
some sites may not keep explicit information about genre, or about track releases, or
even about album titles, let alone missing references about the artist or song-title.
What would you do if any of these were missing? You cannot omit them simply because
some will be primary keys in your tables. This is where the exciting part of the project
will start! But this is Phase #2.
Take a look at the Database Schema Design Helper
for ideas on how to approach the problem.
- Incorporate into your crawler the appropriate Java functionality that will allow
you to search and modify attributes and tables in the database. The approach helps
crawlers do away with the inherently non-scalable storage in files on a disk or even
in tables in main memory.
Warning: It is crucial to have successfully completed all steps from
the Getting Ready section before you proceed any further.
In principle, what you need to do is simple:
- writing to a file is replaced by enter records into the appropriate DB table;
- searching for a duplicate in a file is replaced by searching in a DB table;
- etc.
If you have diligently done all the preparation, this should indeed be simple, albeit
technical and depending on the complexity of your crawlers from Project #1. For a
start, look at the example of Java source code we provided - it does execute one
embedded SQL statement in Java; others can be done similarly.
Note: If you finish early, you may proceed to the next phase of the project - there
is substantially more challenge in completing it; starting early will give you more time
to come up with ingenious solutions, which can make the difference between an average
crawler and a great crawler!
In this second week-long part of the project, you should pick one of the two suggested
paths and do your best. Note: They are not equal in complexity.
Option A is much easier, choosing and completing it would give you less credit (comparable
to a grade in the range of 3.0 - 3.6 on this project).
Option B is an open-ended problem and the credit you can receive for it is higher
(up to a grade of 4.0 on this project), but the difficulty also rises considerably.
It is very important that you properly assess the time you can afford put into this, as
well as the strong and weak points of the members of your team. Completing Option A
perfectly will be considered a better achievement that doing work on Option B, but not
reaching any conclusive state.
Still, we want to assure you that partial credit and extra credit will be awarded
generously for your efforts and accomplishments.
Option A - Building MP3 Site Wrapper
Should you pick this option, your goal would be to build a "wrapper" of three known web
sites (of your choice), providing links to MP3 files. You can therefore customize your
database schema and optimize your document parsers to match the format of these web sites.
A wrapper in this case is just a restricted (or targetted) crawler for only these few sites.
It will request documents only they contain and will parse the MP3 links and surrounding
text as necessary in order to retrieve as much interesting and relevant data - from the
viewpoint of a potential user of the crawler. You should at least extract track titles and
artist names. Other useful information might include genre, track release, track size (in
MB), other artist-related info, etc.
Option B - Through Heuristics Toward Higher Efficiency and Accuracy
Deciding to pursue this option, your crawlers are destined to crawl the web, rather than
be restricted to servicing just a few MP3 sites. The rewards of this are numerous,
including finding music that you actually want to listen to (which you are less likely to
find on large MP3 donor sites).
The price to doing that is accordingly high: you should think about solving a host of
problems, which you will come across. You should consider ways to:
- control the search. This includes not going too deeply into any one search
path, that might lead to the undesirable effect of exploring only a narrow set of
sources;
- extract accurate identifying information for each MP3 link. This includes
applying heuristics (among those studied in class, or others of your own) to separate
out key attributes from other irrelevant information in the retrieved document;
- deal with missing data. This might be observed as incomplete information,
that some sites give about their MP3 files - mostly a missing key attribute.
Following are some ideas on how you might deal with some of these problems.
- Look at anchor text as well as text surrounding the MP3 links;
- Look at the actual MP3 filename - often it follows a common name scheme that
includes the artist name as well as the song title;
- Look at the information contained within the MP3 file itself (the ID3 tags, briefly
discussed on the class mailing list) - the format is well documented online.
Warning: You may use this idea only as a last resort, rather than
for all MP3 files, since it involves downloading files and therefore is not scalable.
A very important thing is to constantly be aware of the space constraints on our server
- it only has about 100MB total per group. If you exceed this limit at any point, you
should think about how to minimize your usage, otherwise problems will quickly swamp
not only you, but everyone else using the server and clearly you would be at fault.
Be sure to think of others - they are likely to pay off. We cannot wait to experience the
results together with you!
- Wondering where to start designing your database schema? Think about the following
issues:
- What will the information I need to store look like?
- What robust format would allow me to avoid problems with missing fields
(especially missing primary keys - something, which should never happen)?
- How many tables should you create?
- What attributes will each table have?
- What data type is appropriate for each attribute?
- Which attributes should have indexes on them?
- For the second phase of the project, consider various database schema alternatives.
If a group member has experience in databases, they will have studied normalization
procedures and should be well-versed in the schema design issues. However, this is not
absolutely necessary at this step: since the project is relatively small, it is not
hard to come up with a reasonable schema design by following one's intuition.
You should feel free to implement any design you feel best suits your needs.
Here is one possible idea with a few thoughts attached - the kind you should be
asking yourself when making your own table designs.
Let us consider the following two tables:
- band(name, location, genre)
- track(title, band, size-in-mb, source, url)
where name is the key attribute (presumably unique) for the band
relation, while title and band may be the key attributes for the
track relation. The band attribute in the track relation
and the name attribute in the band relation provide the "joining"
point between the two relations. It is always wise to have indexes on the keys -
name, title and band; also it often helps tremendously to
have indexes on secondary attributes, on which searches are frequent and therefore
need to be fast. In the proposed schema, a natural candidate for this will be the
genre attribute.
Is this schema robust? Does it have potential weaknesses?
Since bands often release multiple recordings (e.g., studio, live, etc.) of a given
track, it seems likely that one needs to specify a title, a band, and a date
in order to uniquely identify a specific track. Therefore the table might be augmented
in order to resolve such a potential conflicts. Also, if there is other relevant
data - lyrics, etc. that you choose to store, this will require yet another set of
fields in your tables. The possibilities are many, the sky is the limit!
If you want to be extra careful, consider adding another text field where any
"ill-formed" data (which does not match the relational schema you have defined) could
be stored. Or maybe you could just discard such extraneous information - it is a valid
design choice.
How about additional tables? They could prove quite useful, especially in the case of
temporary tables, e.g. containing URLs and/or robots.txt files of web sites
already visited. This will not be a justified approach if you can keep those in memory,
but with too many links you will have to either discard old ones, or store them in a
database. The design decision is again up to you.
Common sense suggests that simple design is always best. Extra credit will be awarded
for justified ambitious and prudent efforts!
What to Hand In
Hand in the URL of a top-level web page that lists the team name, team members and explains:
For Phase #1:
- A concise description of your database schema and the rationale behind it.
- A pointer to the table(s) in your database, where your crawler stored the MP3 links,
extracted from the test server.
- The source code of your crawler with incorporated database functionality.
For Phase #2:
- Which option did you choose to work on?
- (Option A only) Which three MP3 donor sites did you choose and why?
- A complete description of your (latest version of the) database schema and an
explanation of the rationale behind it.
- A pointer to the table(s) in your database, where your crawler stored the extracted
MP3 links.
- The source code of your crawler.
- (Option B only) A coherent discussion about how you solved the missing
attribute problem, the extra attribute problem (how you dealt with attributes for
which you had not allocated fields in the tables), heuristics used for identifying
various attributes and what the accuracy was, the policy of your crawler as far as
determining which links to be examined first, other notable features of your crawler,
which you want to share with us.
Note: If you get stuck or can not complete every part of the assignment, do as
much as you can. Partial credit and extra credit will definitely be awarded. If a bug or
software glitch gets you, let us know as soon as possible (we will give credit for
finding these, and even more credit for finding a solution or workaround) - but keep
working on other parts of the assignment.
Good luck!
Valentin Razmov | valentin@cs.washington.edu