Reports - due 6/6/05 10:30am sharp
Each group should hand in two hardcopies of the report as well
as emailing Alan a .pdf or .doc version.
To ensure that people can focus on their classmates' presentations, all reports are due by 10:30am on 6/6 - note this deadline is 100% firm (please don't make us deduct points for late arrivals).
Your reports should be in 12pt font with a minumum of 1" margins on
all sides. The maximum allowed length is 8 pages, including all
figures and graphs (but not counting references (if you include them)
or the attribution appendix described in the next
paragraph).
Single-sided is fine. For the hardcopies: stapled copies
with doublesided printing if possible, no fancy bindings please.
The first page should include the title, author names, and an abstract of less than 200 words. (See the papers on the readings page for example abstracts).
Every report should include an appendix ("Attribution") which
describes the distribution of work and role of each person in the
project. A similar appendix ("Other Code") should describe any code you downloaded
from the net or which was written by a non-group member. These
appendices are not counted in the length restriction.
The tone of any paper is determined by the intended audience. You should write your paper for Dan and for your classmates - i.e. you can assume that every reader will have successfully completed CSE 454 and you may use terms like precision, recall, vector-space model, naive Bayes classifier, etc without explaining how they work. The introduction of your paper may be brief and terse because of this shared context.
Your report should include a discussion of the following issues:
- A statement of the problem you tried to solve. (For most of you this will be a quick specification of a webcam search engine)
- A description of your architecture, focussing on aspects where you think your classmates might have done things differently.
- A discussion of mistakes you may have made (do you wish you designed things differently) or did right. What have you learned from the project.
- A quantitative analysis of at least one design decision or of the overall performance of the system or part of your system. You should choose that aspect of the system on which you wish to focus - there are many, many choices. One simple idea would be to evaluate your mechanism for distinguishing between webcam and non-webcam pages. You could plot precision / recall curves for your untilmate solution. Even better, if you implemented several methods for doing this classification, you could compare their precision and recall. For example, you might do an experiment to see how well discrimination based solely on aspect ration performs. You could compare this to a more complex heuaristic based on several features - or against a naive Bayes classifier.
An alternative study would look at the effectiveness of different techniques for localization (determining the physical location of a webcam). Or of the harvest rate of a focussed crawler - how does your yield of webcams change over time? Does this yield change if you use different heuristics to guide a focussed crawler? Etc. Etc. Choose (at least one) issue that interests you and look at it empirically. Think carefully about the question you are asking, what kind of data you need to answer the question, your interpretation of the data, and the best way to present the data to convince the reader that your interpretation is correct. (Usually a graph). If you think the results are ambiguous or there are other questions in your mind after the experiment, discuss these as well.
Turn in your code by taking the following steps:
- tar and gzip your src directory plus any additional non-Nutch code.
Instructions on how to do this are in the ACM Unix tutorial.
- put the archive in your project directory and email Alan, telling him
where it is.
Group presentations will be in the final exam slot Monday 6/6 10:30-12:30 in our normal room.