CSE 303 Homework 5

Out:Thursday, 10/27/2005
Due:Part A: Wednesday, November 2, 2:30 PMHow to turnin
 Part B:Tuesday, November 8, 2:30 PMHow to turnin
 Part C:Sunday, November 13 (not handed in)
 Part D:Wednesday, November 16, 2:30 PMHow to turnin

Assignment Goals

Problem Overview

We're going to build the essence of Google, which is to say a search engine for web pages. There are two pieces to this:
  1. Building an index.
  2. Answering queries.
An index is a data structure that maps words to URLs. To build it, we begin with a list of URLs for pages to be scanned. We scan each page in that list, extracting from it all the words (the ones visible when the page is displayed in a browser) and all the links to other pages. For each word we encounter while scanning a page, we make an entry in the index associating that page with that word. For each link we encounter, we schedule the page that was linked to for eventual scanning itself. We keep scanning pages, and building the index, so long as there are any unvisited pages that we've encountered links to. This takes a long time.

The second phase is to answer queries. To keep things simple, our queries are only a single word. Our answer is (basically) the list of all pages that contain that word. We spent all that time building the index so that answering queries would be fast (because all we have to do is look up the answer in the index).

This page shows some output from the sample solution, to help make this concrete. The first many lines are from the index building phase; most are just progress indicators so we know the program hasn't gone into some kind of infinite loop. When the index building is done, some simple statistics are printed. The program then prompts for (single word) queries, and answers each with the total number of times the word was seen in the scanned pages (i.e., not the number of documents that have the word, as you might expect from using Google), the (full) list of pages that contain the word, and the number of times that word occurs on each of those pages.

That's the basic idea. To make this tractable, though, we're going to make a critical simplification: instead of fetching web pages as we scan, the pages have been pre-fetched and stored as local files. So, we're actually going to scan these local files. (The files are a good hunk of the pages at www.nytimes.com on 10/26/2005, about 65MB worth.)

Don't Panic

Absolutely the hardest part of all this is the job of scanning each page to extract the words and links. We're supplying that functionality, in the form of a library. It supports only a few calls, one of which is "give me the next interesting piece from the page, and tell me if it's a word or a link." Your part is to control the scanning of pages, to build the index, and to answer queries.

Solution Architecture

Let's start with a picture:

The greenish things are data structures. The dotted arrows and blue circles are related to control flow. The solid arrows are movement of data items. The blue numbers indicate a relationship with the numbered steps below.
  1. Seed the to-be-visited-list with the pages taken from the command line arguments. (Details below.)
  2. Fetch a page name from the to-be-visited-list. (If the list is empty, move to Step 4.) If it has already been visited, toss it out and re-execute this step. If it hasn't been visited, enter it into the already-visited-list and move to the next step.
  3. Scan the page whose name you just fetched:
    • For every word found, update the index: make sure there is an index entry for that word, and that it indicates that this page contains the word. Update the index entry to increment the count of the number of occurences of this word on the page.
    • Enter into the to-be-visited-list each unvisited link scanned from the page.
    When the page has been completely scanned, return to Step 2.
  4. Transition to the query answering phase. Use the index to answer queries.

What You'll Do

Theoretically, what you have to do is pretty simple: My sources indicate that you've seen both of these data structures in CSE 143, so the main challenges should have to do with C, rather than with data structure design.

Files We Provide

Start a shell and cd into the directory where you want your code to live. Then:
$ scp yourAttuLogin@attu:/cse/courses/cse303/05au/hw5/hw5dist.tar.gz .  # Note the period at the end
$ tar xzf hw5dist.tar.gz
Here's what you get:

Files We Somewhat Reluctantly Provide

You'll notice that the list above didn't include the data. The data is somewhat big: 13MB to fetch gzip'ed, and over 65MB expanded. What to do about it depends on where you're developing.
attu
Just use the path /cse/courses/cse303/05au/hw5/data whenever you have to specify the root of the data tree (i.e., in command line arguments to lexerdriver and your program). There's no reason to create your own copy of the data.

Lab PCs
You could just use the path /cygdrive/o/cse/courses/cse303/05au/hw5/data when you need to specify the root of the data tree. It will work, but that's a "mounted directory", and performance when you're retrieving files across the network is pretty bad. I recommend fetching the data and expanding it on the local drive, i.e., the C:\ drive (not your home directory). If it's not there when you sit down at a machine, just fetch and expand it again. It takes only a very short while (much less time than you'll save by having it local if you run even just once, compared to using the mounted directory).

Other
Fetch the gzip'ed file and expand.

To get the data:
$ mkdir data
$ cd data
$ scp yourAttuLogin@attu:/cse/courses/cse303/05au/hw5/hw5-data.tar.gz .  # Note the period at the end
To expand:
$ tar xzf hw5-data.tar.gz
That creates a new subdirectory, data, and expands the data files into it.

Release History

Who knows, the code we distributed may have some bugs. Everything (except the data) has version information, so you can tell if you have the latest release if we have to re-release anything.

  • Version 1.1 Release - 11/9/2005 12:04PM
  • About Bugs

    One nice, real-world property we share with Google is this: when the user gives us a query, we can answer with pretty much anything we feel like and there's virtually no way the user can know whether or not we got the absolutely correct answer. The really nice thing for us about that is that we can concentrate on what the real point of the assignment is (see the Goals section at the top), and not get all wrapped up in producing a software artifact that exactly meets some detailed spec. (That would be particularly hard for this assignment because it gets unusually intimate with the operating system's file system. This has exposed some differences between running on cygwin/Windows and Linux, making it a lot of work to write code that runs identically on both. I haven't bothered to track down all these system dependencies -- it's way past the point of diminishing returns.)

    The Lexer Routines

    Looking at the code in lexerdriver.c may help answer questions you have after reading the following description. Additionally, lexer.h contains information for the programmer.

    There are three routines that, together, give you a simple way to extract words and links from pages.

    The Assignment Parts

    The assignment is broken into parts, to make it more managable. It's possible that in working on part B, for instance, you want to change something in the code you completed for part A, which you're already handed in. That's perfectly okay.

    The amount of time allowed for each part reflects my guess about how difficult you'll find that part to be, as well as the midterm on 11/4 and the holiday on 11/11. (Part C's schedule is designed to roughly give you that holiday weekend off - I think you should be able to complete it before the 11th if you work on it immediately after handing in Part B, and I think you can complete Part D if you start on the 14th, assuming things have been going about averagely well on the earlier parts.)

    Part A - Wednesday, 11/2, 2:30 PM
    Fill in the missing parts of the doubly linked list implementation, in dllist/dllist.c. The .h file contains the full interface definition and specifications. When you're done with this part, dllistdriver should work, and should be getting correct results. (That it does so isn't proof that your code is correct, of course. As always, testing can only prove that there are bugs, not that there aren't.)

    Part B - Tuesday, 11/8, 2:30 PM
    Implement most of steps 1, 2, and 3 from the Solution Architecture section above. The part you'll leave out is building the index. What that means, though, is you'll end up with code that "crawls" over a set of pages, scanning each page for links (and ignoring words for now), and following the links you find until there is nothing left unvisited.

    Your program should end up visiting about the same number of pages as the sample executable when run on the same platform. (The sample executable gets different answers on different platforms, for reasons I'll guess have to do with some system dependencies in either flex or lexer.c.)

    Part C - Monday, 11/13, Not Handed In
    Design and implement the binary search tree package. I recommend filling in btdriver.c to help test as you develop, but how to test is left to you.

    Note: Your design should keep in mind the eventual use of the implementation. At the very least, you'll use it to build the word-to-page index.

    Part D - Wednesday, 11/16, 2:30 PM
    Complete the implementation.

    Your program should take the same command line arguments as the sample executable, except that you do not need to support the -v switch. Those arguments are:

    Example: Suppose I'm in some directory, hw5, and the data files are in a directory named data in the parent directory of hw5. Then I might say
    $ ./hw5 ../data www.nytimes.com/index.html
    

    How To Build

    This is slightly complicated, and is explained on a separate page.

    Debugging

    This program is quite a bit more complicated than the one for HW4. You can use printf() to debug, but I think you'll save a lot of time in the long run by learning at least the most essential aspects of gdb, a debugger. It will let you set breakpoints in your program, and let you examine variable values when they're reached or the program crashes.

    We'll talk a bit about gdb in class, but that's no substitute for actually trying it out. An easy way to do that is to go through the short exercises from an earlier offering of this class. They will show you enough about gdb to make a big difference to how hard you'll find it to debug this (and future, well beyond this course) programs. This can be done at any time, but there's no reason not to go through them as the very first thing you do for this assignment.

    Links to more information about gdb can be found on the CSE 303 Documentation Links page.

    What to Turn In

    Create a gzip'ed tar file with a name like <yourname>-hw5-partA.tar.gz that contains everything we need to build and run your code at each turn-in step (all your source, any scripts you might have written, whatever), but do not tar up the data files.

    You can check what it is you put in the tar file with

    $ tar tzf <yourname>-hw5.tar.gz | less
    
    (In fact, you can create a new directory, expand your tar file into it, and then build and run there, as a way of making sure you have in fact included everything.)

    Use the turnin command on attu to submit your tar.gz file. (We'll figure out how to name the various parts shortly...)