| Out: | Thursday, 10/27/2005 | ||
| Due: | Part A: | Wednesday, November 2, 2:30 PM | How to turnin |
| Part B: | Tuesday, November 8, 2:30 PM | How to turnin | Part C: | Sunday, November 13 (not handed in) | Part D: | Wednesday, November 16, 2:30 PM | How to turnin |
malloc() and
free()
struct's
gdb, make, ar, ...
We're going to build the essence of Google, which is to say a search engine for web pages. There are two pieces to this:An index is a data structure that maps words to URLs. To build it, we begin with a list of URLs for pages to be scanned. We scan each page in that list, extracting from it all the words (the ones visible when the page is displayed in a browser) and all the links to other pages. For each word we encounter while scanning a page, we make an entry in the index associating that page with that word. For each link we encounter, we schedule the page that was linked to for eventual scanning itself. We keep scanning pages, and building the index, so long as there are any unvisited pages that we've encountered links to. This takes a long time.
- Building an index.
- Answering queries.
The second phase is to answer queries. To keep things simple, our queries are only a single word. Our answer is (basically) the list of all pages that contain that word. We spent all that time building the index so that answering queries would be fast (because all we have to do is look up the answer in the index).
This page shows some output from the sample solution, to help make this concrete. The first many lines are from the index building phase; most are just progress indicators so we know the program hasn't gone into some kind of infinite loop. When the index building is done, some simple statistics are printed. The program then prompts for (single word) queries, and answers each with the total number of times the word was seen in the scanned pages (i.e., not the number of documents that have the word, as you might expect from using Google), the (full) list of pages that contain the word, and the number of times that word occurs on each of those pages.
That's the basic idea. To make this tractable, though, we're going to make a critical simplification: instead of fetching web pages as we scan, the pages have been pre-fetched and stored as local files. So, we're actually going to scan these local files. (The files are a good hunk of the pages at
www.nytimes.comon 10/26/2005, about 65MB worth.)
Absolutely the hardest part of all this is the job of scanning each page to extract the words and links. We're supplying that functionality, in the form of a library. It supports only a few calls, one of which is "give me the next interesting piece from the page, and tell me if it's a word or a link." Your part is to control the scanning of pages, to build the index, and to answer queries.
Let's start with a picture:![]()
The greenish things are data structures. The dotted arrows and blue circles are related to control flow. The solid arrows are movement of data items. The blue numbers indicate a relationship with the numbered steps below.
- Seed the to-be-visited-list with the pages taken from the command line arguments. (Details below.)
- Fetch a page name from the to-be-visited-list. (If the list is empty, move to Step 4.) If it has already been visited, toss it out and re-execute this step. If it hasn't been visited, enter it into the already-visited-list and move to the next step.
- Scan the page whose name you just fetched:
When the page has been completely scanned, return to Step 2.
- For every word found, update the index: make sure there is an index entry for that word, and that it indicates that this page contains the word. Update the index entry to increment the count of the number of occurences of this word on the page.
- Enter into the to-be-visited-list each unvisited link scanned from the page.
- Transition to the query answering phase. Use the index to answer queries.
Theoretically, what you have to do is pretty simple:My sources indicate that you've seen both of these data structures in CSE 143, so the main challenges should have to do with C, rather than with data structure design.
- Finish the skeletal implementation of a set of routines that provide doubly linked lists.
- Design and implement a set of routines that provide binary search trees.
- Design and implement the main program - create the index and then use it to answer queries. This code uses the data structure implementations, of course.
Start a shell andcdinto the directory where you want your code to live. Then:$ scp yourAttuLogin@attu:/cse/courses/cse303/05au/hw5/hw5dist.tar.gz . # Note the period at the end $ tar xzf hw5dist.tar.gzHere's what you get:
hw5soln.exe
Acygwinexecutable for the sample solution. Just delete it if you're not working incygwin- the sample solution executable forattu(and similar platforms) is available on that machine as/cse/courses/cse303/05au/hw5/hw5soln.hw5.c
A virtually empty placeholder file, just to indicate that we're hoping you'll call your main code (that orchestrates building the index and then handles queries) by this name.liblexer.aandlexer.h
(Note: This is acygwin-only version of the library. If you are working elsewhere, delete it and then issue the command "make liblexer.a". Things happen, and it should appear. This process requiresmakeandflex, so they must be installed on your system for it to work. If anything goes wrong, you can fetch the library fromattu:/cse/courses/cse303/05au/hw5/liblexer.a)These are the files you need to use the routines that know how to scan pages to extract words and links. The first is a library, containing the compiled form (.o files) for the routines. You need that to link with any code that uses those routines. You need the second to compile code calling those routines.
lexerdriver.c
A program that I used for testing the lexer routines. It calls the lexer routines and prints what it gets from them.This code is not part of the Google implementation; it's just for testing.
- This might be of use in understanding what the lexer routines do.
- This routine is a useful starting point for your main routine. When it comes time to start working on
hw5.c, my recommendation is to copy this file tohw5.c(yes, overwriting the nearly empty file that you already have).Example invocation:
$ ./lexerdriver ./data www.nytimes.com/index.html
The first argument is the "data root directory", i.e., a path to the directory in which the data files live. The second argument is a path relative to the data root directory that names a file to be scanned.lexer/
A directory containing the code that can scan pages and return words and links. This is provided for completeness - with luck, you should never have to look in this directory. It's also provided in case it might help with debugging.Although I've tried to make the lexer code resilient to the kinds of errors that seem likely to come up in calls to it, it is impossible to check everything that might go wrong. This means that your program could well crash when executing lexer code. The lexer being open source, it's a lot easier to debug if that does happen than if I hid the source from you. (Note: it would be unfortunate, but reasonable, to look in code in source file
lexer.cwhile debugging. It's very unlikely that looking inlex.yy.cwill be useful, even if the programming is crashing in there: that code is machine generated, and isn't really human readable.)dllist/
A directory, containing the linked list skeleton implementation:
dllist.h
A complete.hfile for the doubly linked list implementation.dllist.c
A partial implementation of doubly linked lists. You need to complete the code in here for Part A of the assignment.dllistdriver.c
Another driver, this one for testing the doubly linked list implementation. Compile, link, and run it to help find bugs as you develop. It is not guaranteed to do exhaustive testing. Feel free to add tests to it, or otherwise modify it as you please.Example invocation:
$ ./dllistdriverbt/
A directory, containing the binary search tree skeleton implementation:
bt.h
A nearly empty.hfile for the binary search tree implementation. Once you design the interface, put function prototypes and the like here. (Look atdllist.has an example.)bt.c
A nearly empty file that will eventually contain your binary tee implementation.btdriver.c
A nearly empty driver to help test your binary search tree implementation. I recommend taking the few minutes it will take to write this test code, as I think you'll save a lot of time in the long run by doing this sort of "unit testing". (Look at the code indllistdriver.cto get an idea of what you might put in this file.)appSettings.h
Provides some definitions used in much of the code (includng the lexer). In particular, it defines a pre-processor macro that is used to test whether a function has been passed a null pointer. That's something you want to test for while debugging. To get maximum execution speed, though, you don't want that overhead once there are no more bugs. This.hprovides a simple way to "get rid of" the tests everywhere, by simply editing one line in it and then re-compiling/linking everything.Bottom line: You don't absolutely have to look at this, but you probably will end up doing so.
README.make
A README file in case you want to try using themakefile's. (See the next file description.)makefilemakefile's do what yourgoscript did in HW4, only better. I used them to develop this code. I'm distributing them because I think it'll be easy, and a good idea, for you to try using them as well, even before we get to them in class. However, (a) it's not required, and (b) I don't anticipate trying to support them. If you don't create any new files (and I don't think you should need to), they should work. ReadREADME.make, and the section below on Building, for more more info.
You'll notice that the list above didn't include the data. The data is somewhat big: 13MB to fetchgzip'ed, and over 65MB expanded. What to do about it depends on where you're developing.To get the data:attu
Just use the path/cse/courses/cse303/05au/hw5/datawhenever you have to specify the root of the data tree (i.e., in command line arguments tolexerdriverand your program). There's no reason to create your own copy of the data.
Lab PCs
You could just use the path/cygdrive/o/cse/courses/cse303/05au/hw5/datawhen you need to specify the root of the data tree. It will work, but that's a "mounted directory", and performance when you're retrieving files across the network is pretty bad. I recommend fetching the data and expanding it on the local drive, i.e., the C:\ drive (not your home directory). If it's not there when you sit down at a machine, just fetch and expand it again. It takes only a very short while (much less time than you'll save by having it local if you run even just once, compared to using the mounted directory).
Other
Fetch thegzip'ed file and expand.$ mkdir data $ cd data $ scp yourAttuLogin@attu:/cse/courses/cse303/05au/hw5/hw5-data.tar.gz . # Note the period at the endTo expand:$ tar xzf hw5-data.tar.gzThat creates a new subdirectory,data, and expands the data files into it.
Who knows, the code we distributed may have some bugs. Everything (except the data) has version information, so you can tell if you have the latest release if we have to re-release anything.
- All source files have a comment near the top with a version number.
$ hw5soln -vwill cause the sample solution to print its version number.
Version 1.1 Release - 11/9/2005 12:04PM
One nice, real-world property we share with Google is this: when the user gives us a query, we can answer with pretty much anything we feel like and there's virtually no way the user can know whether or not we got the absolutely correct answer. The really nice thing for us about that is that we can concentrate on what the real point of the assignment is (see the Goals section at the top), and not get all wrapped up in producing a software artifact that exactly meets some detailed spec. (That would be particularly hard for this assignment because it gets unusually intimate with the operating system's file system. This has exposed some differences between running on cygwin/Windows and Linux, making it a lot of work to write code that runs identically on both. I haven't bothered to track down all these system dependencies -- it's way past the point of diminishing returns.)
Looking at the code inlexerdriver.cmay help answer questions you have after reading the following description. Additionally,lexer.hcontains information for the programmer.There are three routines that, together, give you a simple way to extract words and links from pages.
int setRootDir( char* rootDir );
Call this routine exactly once. Call it before calling either of the other routines below. Pass it the file system path for the root directory of the data, which typically you get as the first command line argument to your program (e.g.,/cse/courses/cse303/05au/hw5/data). The path is any valid path, given the current working directory - anything you could give to anlscommand typed into the shell.Returns: 1 for success, 0 for failure. Terminates your program's execution if called again after an invocation that has returned success.
int setPage( char* pFilename );
Must be called aftersetRootDirand beforegetToken()(described next). Can be called repeatedly. Typically is called only near the beginning of execution and then each timegetToken()returns end-of-file.Pass the name of a page you want to scan. Names are relative to the data root directory, so any command line arguments to your program that you pass to this routine should be given in that way. (See the example invocation of
lexerdriverabove.) WhengetToken()returns a link that it has found in a page, the link it hands back is always suitable for passing tosetPage().Returns: 1 for success, 0 for failure. Terminates your program's execution if called before a successful
setRootDir()invocation.char* getToken( int* type );May be called only after a successful invocation of
setPage(). Typically called repeatedly, until end-of-file is indicated by the returned value. Each successive call returns the next item of interest from the current page: a word, a link, or end-of-file. The kind of thing being returned is indicated by*type, which is an output parameter. A word or link is returned as a string through the function's return value.Returns: Sets
*typeto one of LEXER_WORD, LEXER_LINK, LEXER_EOF, and LEXER_INTERNAL_ERROR (seelexer.h). (If it indicates an internal error -- and I've never seen it do that -- just ignore it and callgetToken()again.) If returning a word or a link, the returnedchar*value is a pointer to a string that is the word or the name of the page linked to; for the two other cases, the returnedchar*has undefined value. When there is a returned string, it "belongs to" the caller -- the caller is responsible forfree()'ing the memory it uses.Terminates your program's execution if called before a successful invocation of
setPage().Behavior is undefined if an invocation returns LEXER_EOF and
getToken()is called again before a successfulsetPage()invocation has occurred.
The assignment is broken into parts, to make it more managable. It's possible that in working on part B, for instance, you want to change something in the code you completed for part A, which you're already handed in. That's perfectly okay.The amount of time allowed for each part reflects my guess about how difficult you'll find that part to be, as well as the midterm on 11/4 and the holiday on 11/11. (Part C's schedule is designed to roughly give you that holiday weekend off - I think you should be able to complete it before the 11th if you work on it immediately after handing in Part B, and I think you can complete Part D if you start on the 14th, assuming things have been going about averagely well on the earlier parts.)
Part A - Wednesday, 11/2, 2:30 PM
Fill in the missing parts of the doubly linked list implementation, indllist/dllist.c. The.hfile contains the full interface definition and specifications. When you're done with this part,dllistdrivershould work, and should be getting correct results. (That it does so isn't proof that your code is correct, of course. As always, testing can only prove that there are bugs, not that there aren't.)
Part B - Tuesday, 11/8, 2:30 PM
Implement most of steps 1, 2, and 3 from the Solution Architecture section above. The part you'll leave out is building the index. What that means, though, is you'll end up with code that "crawls" over a set of pages, scanning each page for links (and ignoring words for now), and following the links you find until there is nothing left unvisited.Your program should end up visiting about the same number of pages as the sample executable when run on the same platform. (The sample executable gets different answers on different platforms, for reasons I'll guess have to do with some system dependencies in either
flexorlexer.c.)
Part C - Monday, 11/13, Not Handed In
Design and implement the binary search tree package. I recommend filling inbtdriver.cto help test as you develop, but how to test is left to you.Note: Your design should keep in mind the eventual use of the implementation. At the very least, you'll use it to build the word-to-page index.
Part D - Wednesday, 11/16, 2:30 PM
Complete the implementation.Your program should take the same command line arguments as the sample executable, except that you do not need to support the
-vswitch. Those arguments are:Example: Suppose I'm in some directory,
- First, a path to the root data directory, the directory that is the root of the subtree holding the data files. The path should be specified in a way that is valid given your current working directory at the time the program is started.
- Then, one or more data files where the scan should start. These names are given relative to the root data directory, that is, they are names that would be valid if the current working directory were the root data directory.
hw5, and the data files are in a directory nameddatain the parent directory ofhw5. Then I might say$ ./hw5 ../data www.nytimes.com/index.html
This is slightly complicated, and is explained on a separate page.
This program is quite a bit more complicated than the one for HW4. You can useprintf()to debug, but I think you'll save a lot of time in the long run by learning at least the most essential aspects ofgdb, a debugger. It will let you set breakpoints in your program, and let you examine variable values when they're reached or the program crashes.We'll talk a bit about
gdbin class, but that's no substitute for actually trying it out. An easy way to do that is to go through the short exercises from an earlier offering of this class. They will show you enough aboutgdbto make a big difference to how hard you'll find it to debug this (and future, well beyond this course) programs. This can be done at any time, but there's no reason not to go through them as the very first thing you do for this assignment.Links to more information about
gdbcan be found on the CSE 303 Documentation Links page.
Create agzip'edtarfile with a name like<yourname>-hw5-partA.tar.gzthat contains everything we need to build and run your code at each turn-in step (all your source, any scripts you might have written, whatever), but do not tar up the data files.You can check what it is you put in the
tarfile with(In fact, you can create a new directory, expand your$ tar tzf <yourname>-hw5.tar.gz | lesstarfile into it, and then build and run there, as a way of making sure you have in fact included everything.)Use the
turnincommand onattuto submit yourtar.gzfile. (We'll figure out how to name the various parts shortly...)