Project 4 - File Systems 

Out: Monday February 25th, 2002
Due: Friday, March 15, 2002

Assignment Goals

Overview

The starting point for this assignment is a simplified file system, cse451fs, the design of which imposes strict limits on both the number of files that can be stored and the maximum size of any one file.  In particular, no matter how big a disk you might have, this file system can hold only about 8,000 distinct files, and no file can be larger than 13KB.  These restrictions result from the choice of on-disk data structures used to find files and the data blocks of a given file, that is, the superblock and inode representations. 

Here are the major steps involved in this assignment:

  1. Get in touch with your team

  2. Make sure that you, individually, understand the mechanical aspects of "development environment" you'll be working in.  These include how to build the file system code, how to configure the raw disk device provided by VMware to host your file system, and how to run and test your file system.  A description of these mechanical aspects is here.  It's perfectly okay, and laudable, to work as a team on this so that each individual can get through this step with the least total effort. 

  3. As a team, design how you want to represent the file system on disk: what the superblock looks like, the inodes look like, how you keep track of free/allocated space, etc.  (Those are the minimal changes need to relax the size restrictions of the current implementation, and are probably sufficient for anything you'll want to do.) 

    There are many tradeoffs involved among factors such as maximum file size, the number of files you can store, the amount of raw disk space (blocks) spent on management data structures (and so not available to hold file data), efficiency (say as measured by the number of disk IOs needed to access the first byte of a file, or the number required to access just the Nth byte, or all bytes).  An ideal (but unachievable) design would:

    • Be able to store a single file that was as big as the raw disk capacity, no matter how big the disk.
    • Be able to store as many distinct files as there are blocks on the physical device.
    • Be able to access the Nth byte of the file in one IO operation.

    You may decide to come very close to achieving one of these goals while compromising significantly on the others.  Or, you may decide to compromise somewhat on all of them, so that no one aspect is terribly bad. 

    That latter is the approach taken by ext2, the default Linux file system.  That system is motivated by measurements of real file systems showing that most files are small (under 8K or so) and read sequentially.  You can make these same assumptions in your design, or you can make other assumptions entirely.  For instance, you might want to design a file system for the explicit purpose of storing streaming media files (e.g., audio or video clips).  Those files are large, and when used for playback are read sequentially with the requirement that the times to access successive chunks must have low variance.

  4. As a team, alter the skeleton code (spinlock:/cse451/Proj4FS-Vxxx.tar, where "xxx" is likely to change periodically, at least for the next day or so) to implement your file system.  There are two major components to this.  One is that that user level program mkfs.cse451fs must be changed to initialize the raw disk device with a valid, empty file system using your new on disk data structures.  The other is to change the file system source (fsSource/) itself.

  5. Again as a team, do a paper design of a "persistent file system."  By that term I mean a file system in which all old versions of files remain accessible.  As an example, suppose you edit file foo.txt and save it.  The next program to open foo.txt will see (by default) the results of your edits.  But, your file system also provides some way to still retrieve the contents of the file before the edits (and in fact every version of the file you've ever saved).  That could be done by creating a new open system call, or by extending the file system namespace - for example, foo.txt:0 might be the first version ever written, foo.txt:1 the next, etc.

    To do this design, one definitional problem you have to decide is whether it is file names that are persistent or just data.  For example, suppose I rename foo.txt as newfoo.txt. Does newfoo.txt have all the old versions of foo.txt, or do those stay behind with the name foo.txt? What does it mean in your system to delete a file? What does it mean to move a directory?

    Another definitional problem is what constitutes a "new version" (and so what has to be recoverable).  The usual answer to this is that a version become "final" when the last process currently having it open closes it.  (So, if process A opens the file and starts writing, then B opens and starts writing, and then C opens and starts writing, you don't have to remember every write that took place, but only the final contents after all three have closed the file.)

    Your design should answer the question "what we are trying to achieve" (e.g., the definitional problems above should be answered), plus indicate how file versions will be stored, and (briefly) indicate what mechanisms would have to be implemented to extend the file system you actually built to include persistence (e.g., can/would you use copy-on-write? how?)

Details

Schedule and What to Hand In

I'd like to meet with each team the end of next week for about 15 minutes.  Information on how to sign up for a meeting time will be sent shortly.  Think of the meeting as "Part 1 Turn-in," but without the hassle of having to write a report.  You should have (easily) gotten through you initial design by then, and have at least started implementation.

In class on March 15, hand in a single report from your team that includes:

  1. a description of what your file system is trying to accomplish
  2. a description of the design for the file system you implemented.  This might include a discussion of other approaches you considered but rejected, if any.
  3. a description (not code) of what you had to do to implement (e.g., which files required major changes, and of what sort)
  4. a description of how you tested your file system (for functionality), and whether or not it works
  5. a description of the goals and design of your hypothetical persistent file system.  (Again, you might also describe other ideas you discussed but rejected, and say why.)

I imagine these reports will be in the range of 4-10 pages.

While we are not asking for code, please make sure to keep a copy as we may ask for it after reading your report.

Teams and Grading

Teamwork is a double edged sword in classes.  On the one hand, I'm hoping that working in teams will let you learn more with less effort, and with more fun.  On the other hand, there are many problems that could in theory arise, not the least of which is disagreements over who is contributing and who isn't.  These sorts of problems are not too frequent, in my experience, but if you find yourself in one, or think you might be edging into one, please get in touch with me as early as possible.  I'll say in advance that the only way I can handle these problems is by speaking with everyone involved, to get all sides of the picture.  Often the problems are mainly communication issues, and can be resolved.  That, at least, is the outcome we'll hope for.

For grading, all team members are rowing the same boat, and the default is one grade per boat.