CSE 351: The Hardware/Software Interface, Winter 2011
  CSE Home   About Us   Search   Contact Info 
 
Course Home
  Home
Administation
  Overview
  Course email
  Anonymous feedback
  View feedback
 
Assignment Utilities
  Home Virtual Machines
  Homework Turnin
  Class GoPost Forum
  Gradebook
 
Most Everything
  Schedule
    Homework 6 - Memory Mapped Files
Out: Wednesday February 23
Due: Tuesday March 8, 11:59PM
Turnin: Online
You may work in teams of up to 2.

Overview

This assignment involves nearly all the topics covered in this course so far. Measured in lines of code, it's quite small. But, there are many ways for things to go wrong. Don't delay getting started.
Estimated time to complete: 4-25 hours

In this assignment you implement a form of virtual memory, in particular, memory mapped files. The idea is based on a real facility, mmap. The essential idea is to map a region of virtual memory to a regular file. The file can then be read and written using normal memory operations (e.g., rmmovl), without having to do normal file system read/write operations. Because we're not the operating system, we can only approximate mmap, but the basic ideas are preserved in our version. (You do not need to read about or understand mmap to do this assignment.)

Beyond virtual memory, a second goal of this assignment is to gain some experience using pointers in C.

A final goal is to get some first-hand experience with code, compile, and run time errors, their symptoms, and how to fix them (all in C). After you've done the assignment, you should have a better understanding of the roles of .h files and the linker.

There are four parts to the assignment: fixing an intentionally introduced error; fixing another intentionally introduced error; implementing a recursive procedure that traverses a linked data structure; and implementing mapped files. These build on each other, in the sense that each provides experiences useful in the following steps. All of them involve a single application. I'll next describe the application, and then give specifics on the four parts.

The freedb Application

freedb is a community generated database of CD contents. Basically, it provides information like CD title, artist, and track names (among other things). The application in this assignment uses the freedb information to support three kinds of "queries": print, artists, and artist:
  • print prints the entire contents
  • artists prints the names of all artists
  • artist takes an artist name as an argument, and prints information on all CDs for that artist

To support the application, I've taken some of the freedb data and and organized it as a hierarchical set of binary search trees. Here's a sketch of the organization:

There is a node per artist in the main tree. Each artist node contains left and right pointers, leading to nodes with artists whose names are smaller or greater, respectively, in sorted order. Each artist node also contains a "child" pointer that points to the root of a sorted tree of CDs for that artist. (Those nodes are shown in blue in the diagram.) Finally, each CD node has a child pointer pointing the root of a tree of track titles (shown in red).

The roles of the nodes (artist, CD, or track) are as indicated above. In terms of the C implementation, though, all nodes have the same type, Record (declared in Record.h):

typedef unsigned int RecordAddress;

typedef struct {
  char          string[52];
  RecordAddress leftPtr;
  RecordAddress rightPtr;
  RecordAddress childPtr;
} Record;
That uniformity is intentional, as it allows a single C procedure to operate on any of the artist, CD, or track trees.

The entire tree (shown in the figure) is stored in a file. The file itself represents an address space, with addresses from 0 to the file size (minus one). Pointers in the tree are byte addresses in that address space -- that is, they're byte offsets within the file.

Records are laid out one after another in the file, except that the first record's worth of space (64 bytes, since records are 64 bytes long) does not contain a record. Instead, bytes 0-3 contain an integer giving the address of the root artist record. (Like all addresses in this virtual address space, the address in bytes.) You should assume that there is no simple relationship between a record's position in the tree and its location in the file. The only sensible way to access the records is by traversing the tree, starting at the root.

Running the freedb Application

$ ./hw6file freedb-tiny.dat 

command> artists
'Anibal Troilo'
'Bambini di Praga'
'Behle Brothers'
'Cathy Winter'
'Channel Faergolzia'
'Mambo Chachacha'
'the gourds'

command> artist Cathy Winter
Artist: 'Cathy Winter'

Next Sweet Time
    00 A Fine Romance
    01 Deep Waters
    02 James Bay
    03 Electrician Blues
    04 How Many Lights?
    05 Strong Hearts
    06 New House
    07 Next Sweet Time
    08 House of the Rising Sun
    09 The Wreck of the Henry Clay
    10 Heart to Heart With You
    11 Filter It Out
    12 Short Moments

command> 
Closes:1
Opens:1
Reads:23
Seeks:23
Stats:0
There is also a print command, which prints the entire contents of the file in sorted order. It's not shown here for space reasons.

You exit the application by generating EOF - type ctrl-d. On exit, the implementation prints some statistics on file system operations it performed.

The freedb Application Implementation

The freedb application implementation is layered, as shown here:
The mainline reads and parses command lines, and then invokes an appropriate command routine, each of which handles one kind of command (print, artists, or artist).

The command routines use a RecordStore, a layer that understands the fact that the file contains records, and that the address of the root record is in bytes 0-3.

The RecordStore layer uses a file reader. The file reader thinks of the file as an array of bytes - it doesn't know about the format of the data in it. Because in our application files are read-only (no updates to them are supported), the main interface provided by this layer is a read operation. Each read call specifies a byte address and length.

There will, eventually, be two versions of file reader implementation. The first, supplied in the skeleton code, handles read requests by fetching data using the file system (i.e., with lseek and read operations.) The second, which you will write, implements memory mapping. The file is thought of as a virtual address space (VAS). Your code allocates a chunk of memory to use as a cache of the contents of the file's VAS. It then handles read calls by copying data out of the cache, first bringing in a page of the file if necessary. Details are provided in the Part 4 instructions.

Obtaining the Code

The files are in hw6.tar.gz. To extract the files from this archive, use the command:

tar xzf hw6.tar.gz

and the files will be extracted into subdirectory hw6.

Part 1: Getting Build to Work

Part 1A
In your homework directory, issue the command make. It will fail.

make follows the directives given in the file makefile, which has this contents:

CFLAGS=-g -Wall
BASEOBJS=artistCDs.o listArtists.o main.o printTree.o RecordStore.o
FILEOBJS=$(BASEOBJS) FileFileReader.o

file: $(FILEOBJS)
	gcc $(FILEOBJS) -o hw6file

clean:
	rm -f *.o *~ hw6file
The first three lines define symbols. Those symbols are "expanded" in commands issued later in the file, using the $ operator.

The line beginning file: says that to build "target" file1, make must first build artistCDs.o, listArtists.o, etc. Make knows how to build .o files from .c files, so it looks for .c files with those names to compile. One is missing.

To get make closer to a successful build, you need to supply the missing file. You should create the file and write just enough of the contents to make it compile successfully, but no more. (That is, don't really try to implement what should eventually be the contents of that file, just implement enough to allow compilation.)

Part 1B
At this point, make should be printing commands indicating it is compiling individual .c files. But, when it's done doing that and tries to link (you'll see a list of .o files on the gcc command line it issues), it fails. There is at least one reason why it fails: the readline library is needed to link the final program, but it hasn't been specified on the command in the makefile that does linking:

	gcc $(FILEOBJS) -o hw6file
Figure out how to fix that2, and fix it. (Note: the makefile is sensitive to some whitespace. In particular, the command lines that follow a target must begin with a tab. Be careful not to eliminate the leading whitespace on the command line.)

It's also possible that you'll have linking errors related to the file that was missing in Part 1A. If so, edit that file to provide just enough C code in it to eliminate the linking errors.

When done with this part, make should successfully build executable hw6file. You can run hw6file, but it only somewhat works.

Part 2: Fixing a Bug

The program builds, but it has a bug. Here's one symptom.
$ ./hw6file freedb-tiny.dat 

command> print
Error printing record tree rooted at 8768: Unknown error 4294967295

Part 2A
The first task is to figure out what the bug is. We'll use gdb to help with that.

  • Launch gdb: $ gdb ./hw6file
  • Set a breakpoint on method storeRead (located in file RecordStore.c): (gdb) b storeRead
  • Run the program: (gdb) r freedb-tiny.dat
  • Issue the command: command> print.
    Execution will halt on entry to storeRead.
  • Now print the values of some variables, e.g.: (gdb) p addr
    Look for anything that doesn't seem quite right, and try to figure out why it's not right.
  • You can examine the call stack using the command bt. You can move up the call stack (so that you can look variable values in the caller) with the command up. (Guess how you move back down the stack...)
At the end of this step you should have identified a problem with the code, but not yet tried to fix it.

Part 2B
Having found the problem, you might wonder why the compiler didn't notice the problem for you. You're right, it should have.

Fix the problem that caused the compiler to miss the problem. Re-build the application to verify that the compiler now warns you of the problem.

Part 2C
Now fix the actual problem. Re-build the application, run, and verify that the print command now works.

Part 3: Implementing the artist command

The skeletal application recognizes the artist command, but does not implement it. You should do so. Your code should go in a file not distributed as part of the skeleton.

The artist command takes a single argument, the name of an artist. If it finds an artist with that name in the file, it prints the names of all CDs attributed to that artist, and the names of all tracks on each CD. White space between the first non-whitespace character of the artist name argument and the end of the line is significant.

$ ./hw6file freedb-tiny.dat 

command> artist   the gourds
Artist: 'the gourds'

Best of the Boots 2005 Disc 2
00 Black Tie and Boots Ball Intro
    01 Locomotive Breath / Take Me Back To Tulsa
    02 Boil My Strings
    03 I'm a Commie
    04 Raining In Port Arthur
    05 Mr Betty
    06 Gangsta Lean
    07 I Come Up / All The Labor
    08 If I Lose
    09 Shreveport
    10 Arapaho
    11 Bugs In The Whiskey
    12 Bolshevik Sugarcane
    13 Lucky Moon
    14 Rooty Tooty
    15 Ceiling's Leaking
    16 Omaha
    17 Shake The Chandeliers
    18 Whiskey and Blood
    19 Tex Mex Tattoo Pakistani Package Mile

command> artist  the    gourds
Error finding artist 'the    gourds': 0

Part 4: Memory Mapping

The goal of this part of the assignment is to replace the file system based file reader (FileFileReader.c) with a memory mapped file reader. Your implementation must implement exactly the same interface (defined in FileReader.h). You should place your code in a new file. It's probably easiest to begin by copying FileFileReader.c to a new file (say, MappedFileReader.c).

Your mapped file reader operates by implementing virtual memory. The data file (e.g., freedb-tiny.dat) is the backing store for the virtual memory. A C array of "page frames" serves as the main memory cache for that virtual memory. You create and maintain a page table that indicates, for each page in the virtual address space, whether that page is current cached in main memory and, if so, which frame (i.e., array element) it currently occupies. If the page is not in the memory cache, it's easily located from backing store based on the its address (page number), since the file can be thought of as an array of pages.

To keep this implementation reasonably simple, we let some of the semantics of this particular application leak into the file reader, which, abstractly, should support arbitrary read calls. In particular, we'll assume that all read invocations specify addresses that are 64-byte aligned (i.e., are multiples of 64-bytes) and specify lengths that are no larger than 64 bytes. This means that if your page size is a multiple of 64 bytes (which it should be), no single read request will ever involve data that resides on more than a single page. That assumption (no read spans multiple pages) is the only implication of the above assumptions that you should need to make. Your code should not otherwise depend on 64-byte alignment, for instance, nor on the length specified in the read call.

Details:

  • Your code should work for any page size that is a multiple of 64 bytes. Outside of that restriction, you are free to choose the page size.
  • Your main memory cache should be 128KB in size (128*1024 bytes).
  • Your page table has to be large enough to map the entire address space of whatever file your code is asked to run against (the one specified on the command line launching your application). You can assume that no file is larger than 260111552 bytes, which happens to be the size of freedb-classical.dat, a test file containing the entire contents of the freedb data for CDs whose genre is classical. (Click on that link to fetch the 249MB test file.)
    If you finish the assignment and have extra time, you could consider removing the file size restriction by "new"ing up the page table at run time, after you've determined the size of the file. To do this, you need to use malloc(), which is (very roughly) C's version of new. man malloc (or google) provide details on how to use it.
  • You should use first-in-first-out (FIFO) page replacement. That is, when a read() call is made for data on a page that is not currently in the cache, you should place it in the frame whose contents have been in the cache for the longest time. You can implement this by keeping an integer index that gives the next frame to replace. Each replacement, you increment the index by one, wrapping back to 0 when the index's value reaches the number of frames in the cache.
  • You should instrument your code to keep track of and print out page (cache) hits and misses, as well as file system operations. The sample solution supplied in the distribution, hw6mapped-solution, shows you what to print.
  • Typing the commands to build the application can be tedious. The makefile has been written to make it as easy as possible to create a new target to build your application, by cut-and-paste and a little editing. I'd consider doing that, and using make to build.
    This page looks to be reasonable introduction to make, sufficient for our use of it in this assignment.

Here's a picture of the basic setup, before any read calls have been made.

The figure shows the data file, which is the backing store for the virtual address space. Inside your C program you've created an array to serve as page frames. You've also created a page table (also an array). All page table entries currently indicate that no page is in "memory" (the page frame array).

Suppose now the first read() call asks for data that is in virtual page 3. That page is brought into memory, into the slot for the next frame to be replaced, and the page table entry for it is updated to indicate which frame of memory it occupies. The next page to replace index is updated. Finally, the data asked for by the read() invocation (shown in dark blue) is returned3.

Notes

  • The point of this assignment is to figure out some things, and then to demonstrate that you've figured them out by correctly implementing the various parts of the assignment. Avoid posting questions and answers that allow you or others to avoid figuring things out. Questions about how to do things that are not clearly part of this assignment are okay. (For example, "What does 'x = y?y:-1;' mean?" is ok.)
  • main.c is unfortunately complicated. I expect that you can look at it and extract the few particulars you might want to know about, though. You shouldn't have to understand it in detail, and I recommend that you not spend time trying. (In theory, you should never need to look at it at all.)
    You definitely should not edit main.c, or any other file you haven't been specifically asked to edit as a solution to one of the parts of the assignment.
  • This assignment leaves a lot unsaid. That's intentional, in part as a mechanism to motivate the "figure things out" exercise of the assignment, and in part to make the transition from CSE 142/3-style detailed instructions to what you can expect in the rest of the curriculum (and life), which requires you to do some design.
  • Much of the code you're asked to write has to do things that you can easily imagine must be done by code that was provided to you. Looking at working code that does something similar to what you want is a useful technique when learning any new language, and I recommend you apply that technique in this assignment.

Turnin

We will grade this assignment primarily by reading your code. We may run your code when necessary to clarify things, though.

You should prepare a short (2 page or so) report indicating how you "solved" each part of the assignment. Your discussion of Part 4 (mapped file implementation) should be more detailed than the other parts: give an overview of how your implementation works. If your implementation of any part doesn't work, say when it fails. If you have ideas about why it fails, include those as well.

Give the names of the members of your team at the top of the report.

Tell us how to build your Part 4 solution in an easily spotted portion of the report. Also tell us how to invoke it.

Hand in all of your files, following this procedure:

  • Put your report file in your hw6 directory.
  • In your hw6 directory, say make clean.
  • In your hw6 directory, delete freedb-classical.dat, if you have a copy there (e.g., rm freedb-classical.dat).
  • Now move to the parent of that directory: cd ..
  • Now: tar czf hw6-myname.tar.gz hw6
    where 'myname' is replaced by something identifiably the name of one of the team members.
  • Submit the hw6-xxx.tar.gz file online.

Footnotes

1 Because file is the first target in the makefile, it's the one make tries to build if you just say make without an explicit target argument. make clean would execute the commands that follow the clean: target, deleting files created by the build process.

2 Hint: The easiest fix involves using the -l switch on the gcc command line that links:

  • Type man gcc
  • Now type (without the apostrophes) '/-llibrary' and hit enter. That will search for the string you just typed.
  • Now type 'n' (no apostrophes) to repeat the search, until you get to the section of the man page that explains the use of the the -l switch.

3You can copy an arbitrary number of bytes in C using bcopy(): man bcopy. For example, bcopy(foo, bar, 20) copies 20 bytes starting at the address contained in foo to the address contained in bar.


Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to zahorjan at cs.washington.edu]