Project 4 - File Systems
Out: Monday, November 25, 2002
Due: Wednesday, December 11, 2002
Updates
Assignment Goals
- To understand the problems that file system implementations must solve,
and the range of approaches that might be taken
- To practice design (in this case of file systems)
- To experience working in a "more sophisticated" environment:
complex software, a variety of tools, and teams of designers/programmers
- To gain further experience with concurrent software
Overview
The starting point for this assignment is a simplified file system, cse451fs,
the design of which imposes strict limits on both the number of files that can
be stored and the maximum size of any one file. In particular, no matter
how big a disk you might have, this file system can hold only about 8,000
distinct files, and no file can be larger than 13KB. These restrictions
result from the choice of on-disk data structures used to find files and the
data blocks of a given file, that is, the superblock and inode
representations.
Here are the major steps involved in this assignment:
- Find a new partner. You may work with your project 3 partner if you really
want to, but this is a good opportunity to get to know new people with
similar professional interests, so finding a new partner is recommended. Feel
free to send mail to the list for this. Alternatively, you can send mail to
Alex to be matched up with somebody at random. If you insist on working alone,
send Alex mail and we may be able to work something out, but you will end
up doing more work that way.
- Make sure that you, individually, understand the mechanical aspects of
"development environment" you'll be working in. These
include how to build the file system code, how to configure the raw disk
device provided by VMware to host your file system, and how to run and test
your file system. A description of these mechanical aspects is here.
It's perfectly okay, and laudable, to work as a team on this so that each
individual can get through this step with the least total effort.
- Design how you want to represent the file system on disk: what
the superblock looks like, the inodes look like, how you keep track of
free/allocated space, etc. (Those are
the minimal changes need to relax the size restrictions of the current
implementation, and are probably sufficient for anything you'll want to
do.)
There are many tradeoffs involved among factors such as maximum file size,
the number of files you can store, the amount of raw disk space (blocks)
spent on management data structures (and so not available to hold file
data), efficiency (say as measured by the number of disk IOs needed to
access the first byte of a file, or the number required to access just the Nth
byte, or all bytes). An ideal (but unachievable) design would:
- Be able to store a single file that was as big as the raw disk capacity, no matter how big the disk.
- Be able to store as many distinct files as there are blocks on the physical device.
- Be able to access the Nth byte of the file in one IO operation.
You may decide to come very close to achieving one of these goals while
compromising significantly on the others. Or, you may decide to compromise somewhat on
all of them, so that no one aspect is terribly bad.
That latter is the
approach taken by ext2, the default Linux file system. That system is
motivated by measurements of
real file systems showing that most files are small (under 8K or so) and
read sequentially. You can make these same assumptions in your design,
or you can make other assumptions entirely. For instance, you might want to design a file system for
the explicit purpose of storing streaming media files (e.g., audio or video
clips). Those files are large, and when used for playback are read
sequentially with the requirement that the times to access successive chunks
must have low variance.
- Alter the skeleton code (/cse451/projects/Proj4FS-V1.0.0.tar.gz)
to implement your file system. There are two major
components to this. One is that that user level program
mkfs.cse451fs
must be changed to initialize the raw disk device with a valid, empty file
system using your new on disk data structures. The other is to change
the file system source (fsSource/) itself.
Details
- As in project 3, you should plan to spend a lot of time with your partner
at the whiteboard. Figure on spending maybe 1/2 the time (or more) on design
and half the time on implementation.
- While real file systems are very concerned with performance, in your
implementation you can largely ignore it. That is, do not spend a
great deal of effort to produce a faster implementation. (For one
thing, because we're running on virtual machines on top of Windows, it's
unlikely you'd be able to measure much difference.)
- A description of the skeleton version of the cse451fs file system is
here. (1455 words, necessary reading)
- A description of the ext2 file system and vfs (Virtual File System)
is here. (8835 words, recommended)
- A description of how dynamically loaded modules are handled in Linux is here. (2279 words, not necessary reading but may answer odd questions that arise)
- rcs and cvs are two widely used "source control systems." These systems
offer two primary things. First, they keep track of modifications to individual files, and
let you "go back in time" to get an old version. So, for instance, if a project that was
working stops working, you can compare the most recent version with earlier ones to see what
has changed. (This is useful even if you're a one person project.) The other thing they
do is provide some help with concurrent editing by two or more users of a single file. In rcs,
the "natural" use is to enforce the restriction that only at most one user may have a writable
copy of a file at a time (and so at most one may be editing it. although all
can be viewing or compiling it). In cvs the natural approach
is the opposite: all users may have writable copies and may edit the file, and cvs will
try to merge the individual edits into a coherent whole. (Sometimes it works, sometimes it
doesn't.)
More information on rcs can be found here
or here (as well
as many other sources).
More information on cvs can be found
here or here (and many other places).
Source control systems are very useful, and in the long run well worth the
short while it would take to read enough to try one out.
Schedule and What to Hand In
By December 11, 2002 at 11:59 PM, please turnin a file that includes the
following.
- What objectives is your file system trying to accomplish?
- Describe the design for the file system you implemented.
This might include a discussion of other approaches you considered but
rejected, if any.
- Describe in English (not code) what you had to do to implement your
ideas. (e.g., which files required major changes, and of what sort)
- What concurrency-related issues did you encouter? How did you deal with them?
- How did you test your file system (for functionality)?
- Does your implementation work? If not, what parts work and what parts don't?
How would you fix it?
- What do you like best about your design?
- What would you improve about your design, given another week to
work on it?
I imagine these reports will be in the range of 3-6 pages. A
PDF, text file, HTML file, Word document, or
OpenOffice/StarOffice document is fine. Please include the usernames of both
partners in the filename. For example, rjpower-aquinn-proj4.pdf
would be good.
You will be graded primarily on your write-up. We will be looking at
the clarity of your ideas, viability of your ideas, depth of
your implementation, and completeness of your testing, based on your reports.
However, please do turn in a copy of your code, in case we need to refer to it.
Your code need not be neat or nicely packaged. Just turn in whatever you've
been working with, as is. The emphasis is on the write-up.
Teams and Grading
For this project, you will need a partner. Consider working with somebody you don't know very well. This often leads to better productivity. It always leads to making connections with people with similar professional interests who you might not otherwise get to know.
Collaboration can be a great thing in a project because it lets you tackle a bigger problem, learn more, and learn how to express your technical ideas. Being able to clearly express technical ideas is a super-important for all scientists and engineers. Collaboration also solves the usual problem of wanting to discuss your solution with somebody without spoiling somebody else's fun. However, problems can arise. Maybe one person is contributing more than the other. Maybe one person is over-committed and unavailable. Maybe one person over-writes the other's files. Maybe the duo cannot decide on an acceptable solution. Usually these problems are few and far between, but if you have any such trouble, please get in touch with one of the TAs as soon as possible. All we can really do is try to facilitate communication, but usually that's all it takes.
As far as grading is concerned, both members of the duo are playing the same song, so generally we will award one score per performance.
Turnin
... is not yet enabled. Check back later.
Last updated 11-24-02, Alex Quinn (aquinn@cs)