Steam-powered Turing Machine University of Washington Department of Computer Science & Engineering
 CSE454 Project Description
  CSE Home   About Us    Search    Contact Info 

 Using course email
 Email archive
 Lecture slides
    Class Project Reports
Meeting schedule: here

Projects are a very large component of 454. In the past we've had students build a complete crawler from scratch, but this year we have a more intersting objective... Students will work in groups of three on the following endeavor:

Webcam Search

Search is the rage. Everyone loves Google, and there are many other specialized sites: shopping search, travel search, vacation search, government search, blog search, clustering search, etc. What hasn't been searched yet?

There are thousands (millions?) of cameras connected to the WWW, but they aren't indexed in a way which makes them easy to use. In this project you will build a specialized search engine to find and index webcams. A user coming to your engine should be able to enter a location (e.g. city name) and get a list of active webcams offering current video snapshots of that location.

Your first job is self defense --- read this to learn how to survive this class .

On an unrelated note - you may download and use code that you find on the web (or share subroutines with another group), but you must explain (in an appendix of your report) what externally written code you have used and where you got it.

Finding Webcams There are many ways you can find webcams for your index. You could use a base crawler such as Nutch. You could modify it to use focused crawling. You could start it with a seed queue of URLs generated by a Google query.

Alternatively you could use Google to find them, or to find directory sites such as or

Popular products such as 802.11 base stations have default addresses and settings which allow them to be discovered. Do any of the leading webcam products have features like this which you can exploit to locate them?

Is there any way to find other sensors, i.e. feeds from ATMs?

Classifying Webcams

Classifying webcams is the most open-ended part of this project. You'll (probably) want to answer (at least) the following questions about each webcam you find and store it in your index.

  • Is the webcam active or dead?
  • What is the update frequency?
  • Is the camera inside or outside?
    Can you use day /night cycling times to determine this? What about color cast from tungsten or fluroescent lights?
  • Where is the camera located?
    There are many ways you might try to figure this out.
    • Domain name or IP address.
    • Language.
    • Keywords on the hosting page (information extraction techniques).
    • Time zone.
      The hosting page may include a time stamp. Even if not, if you can identify (for outdoor cameras) the time of sunrise and sunset, this is a huge clue for determining latitude. If you track over several months, you might get longitude.
    • Correlate the weather observed onscreen with weather forecasts, sattelite photos (or images visible in other webcams) for candidate cities.
  • What direction is the webcam pointed?
    Here you could possibly use shadow and shadow movement. Or illumination at sunset and sunrise.

Search for Webcams

Here you are your own. Presumably, you have your cameras indexed by latitude and longitude, but people don't want to enter locations at that level. You could have people enter a place name, and you could look it up in a big table? (How would you disambiguate?) Or people could enter a zip code (might work ok for the USA). Can you get any ideas from looking at mapping sites like mapquest? One great resource is Dan Egnor's prize winning entry to the google programming competition, geographic search, which includes a geocoder (which uses TIGER/Line data to turn street addresses into latitude/longitude coordinates).

Project Plan - due 4/12/05 10:30am

When working out your schedule, keep in mind that you will need to write a report as well as code and test.

Reports - due 6/6/05 10:30am sharp

Each group should hand in two hardcopies of the report as well as emailing Alan a .pdf or .doc version.

To ensure that people can focus on their classmates' presentations, all reports are due by 10:30am on 6/6 - note this deadline is 100% firm (please don't make us deduct points for late arrivals).

Your reports should be in 12pt font with a minumum of 1" margins on all sides. The maximum allowed length is 8 pages, including all figures and graphs (but not counting references (if you include them) or the attribution appendix described in the next paragraph). Single-sided is fine. For the hardcopies: stapled copies with doublesided printing if possible, no fancy bindings please.

The first page should include the title, author names, and an abstract of less than 200 words. (See the papers on the readings page for example abstracts).

Every report should include an appendix ("Attribution") which describes the distribution of work and role of each person in the project. A similar appendix ("Other Code") should describe any code you downloaded from the net or which was written by a non-group member. These appendices are not counted in the length restriction.

The tone of any paper is determined by the intended audience. You should write your paper for Dan and for your classmates - i.e. you can assume that every reader will have successfully completed CSE 454 and you may use terms like precision, recall, vector-space model, naive Bayes classifier, etc without explaining how they work. The introduction of your paper may be brief and terse because of this shared context.

Your report should include a discussion of the following issues:

  1. A statement of the problem you tried to solve. (For most of you this will be a quick specification of a webcam search engine)
  2. A description of your architecture, focussing on aspects where you think your classmates might have done things differently.
  3. A discussion of mistakes you may have made (do you wish you designed things differently) or did right. What have you learned from the project.
  4. A quantitative analysis of at least one design decision or of the overall performance of the system or part of your system. You should choose that aspect of the system on which you wish to focus - there are many, many choices. One simple idea would be to evaluate your mechanism for distinguishing between webcam and non-webcam pages. You could plot precision / recall curves for your untilmate solution. Even better, if you implemented several methods for doing this classification, you could compare their precision and recall. For example, you might do an experiment to see how well discrimination based solely on aspect ration performs. You could compare this to a more complex heuaristic based on several features - or against a naive Bayes classifier.
    An alternative study would look at the effectiveness of different techniques for localization (determining the physical location of a webcam). Or of the harvest rate of a focussed crawler - how does your yield of webcams change over time? Does this yield change if you use different heuristics to guide a focussed crawler? Etc. Etc. Choose (at least one) issue that interests you and look at it empirically. Think carefully about the question you are asking, what kind of data you need to answer the question, your interpretation of the data, and the best way to present the data to convince the reader that your interpretation is correct. (Usually a graph). If you think the results are ambiguous or there are other questions in your mind after the experiment, discuss these as well.
Turn in your code by taking the following steps:
  1. tar and gzip your src directory plus any additional non-Nutch code.
    Instructions on how to do this are in the
    ACM Unix tutorial.
  2. put the archive in your project directory and email Alan, telling him where it is.

Group presentations will be in the final exam slot Monday 6/6 10:30-12:30 in our normal room.

Time (best-case)   Group
10:30G (Craig Atkinson, Chris Gillum, Tim Prouty)
10:40H (Hoang Kha, Bo Lee, Jessica Blat)
10:50D (Tanya Peters, Eric Kochhar, Mary Dang)
11:00F (Constantinos Papadopoulos, Colin Pop, Razvan Gheoca)
11:10I (Mike Chan, Lam Nguyen, Khalil El Haitami)
11:20B (Tilak Pun, Vivek Kumar, Caesar Indra)
11:30C (Lee Faris, , Erik Bronnum, Su Shen)
11:40E (Jim Li, Jack Hebert, Kiarash Ghadianipour)
11:50A (Michael Lindmark, Jon Su, Jenny Yuen)

Each group will have 8 minutes to talk (and demo); then 2 minutes for questions. Ideally, each member of the group will present part of their project's presentation. Powerpoint (or .pdf) slides are great, but don't include more than 5-10. Do practice talks to hone the presentation and (amongst other things) time yourselves. Slides which use pictures to explain things work better than bulleted lists. Make sure your smallest font is 28pt or larger.

In terms of focus: you should discuss the same things as in the report:

  1. How does it work (emphasis on things you think you might have done differently from your classmates).
  2. What choices you were happy with and which you regret.
  3. Any interesting experimental evaluation,
  4. A Demo.
You'll have to be very tight and concise to cover each of these in 8 minutes.(Practice!)

CSE logo Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX