From: Ian King (iking_at_killthewabbit.org)
Date: Sun Feb 22 2004 - 23:08:35 PST
The authors describe a design for a file system that is built on a log of data
written to the disk together with the resulting metadata; unlike file systems
that use logging for crash recovery, the log is the entirety of disk activity.
There is discussion of tuning parameters for the file system, followed by two
sets of performance data: one based on theoretical considerations, one
empirically derived from use of the file system by real-world users.
This paper is quite captivating because it takes something as seemingly prosaic
as filesystem structure and puts a whole new spin on it. (I had to read it more
than once.) As the authors describe, there are many file systems that use a
transaction log to support crash recovery, but these researchers saw that the
entirety of the data written to the disk was in fact in the log - so why not use
the log as the actual repository? With surprisingly little complexity, they
designed a strategy to keep this log current with changes to data and the
supporting metadata, such as directories and data pointers (a/k/a/ inodes -
borrowed from UNIX parlance). This approach borrows from the strategy used with
write-once media, which needs to update its metadata if new data is added to the
media.
However, unlike write-once media, rotating magnetic media may see its data
change or be deleted; with write-once media, this is seen as lost space, but
magnetic media needs to be more accommondating. Therefore, the authors
developed a strategy in which storage can be "cleansed" of no-longer-used blocks
and obsolete metadata; further, they placed the data/metadata in "segments" on
the device so that such cleansing can take place in small regions - with
introduced latency that can be hidden - rather than across the entire storage
device, introducing latency seen in conventional file systems of tens of minutes
or longer.
In further analysis, the authors noted that some blocks will rarely change,
while others are quite volatile. They designed two separate "cleansing"
strategies for each type. In developing their algorithms, the authors discover
some surprising results, e.g. that locality can sometimes work against the
intuitive approach. Their crash recovery is somewhat permissive of modest loss,
but is also amazingly quick; they admit to somewhat naive testing of this
feature, but the file system seems quite robust to random power failures.
(Obviously, this approach does not address media failures.)
The authors develop two sets of performance metrics: the first is based on the
theoretical considerations that drove their design, and performance is seen to
be modest, perhaps even discouraging in some cases. However, when evaluating
the experimental implementation in a real-world environment, with multiple users
of multiple programs, the results are much more encouraging, suggesting that
while actual usage patterns are not well-modeled by their (somewhat simplistic)
testing assumptions, the performance under actual usage vindicates the design.
The real-world scenario includes a wide variety of usage, including a swap
partition for a UNIX system. The authors point out that their "tuning"
parameters are likely not optimal, but nonetheless the file system deports
itself with distinction, at least matching if not outperforming a mature and
tuned UNIX file system implementation (the reference benchmark throughout the
paper).
The resulting file system is very well suited to the use of rotating magnetic
media and large in-memory file caches; by favoring writes of large multi-sector
data chunks with few seek operations, it optimizes the model of conventional
rotating media. By seeking a "good" solution - potentially allowing some loss
of recently added data - rather than a "perfect" solution, the solution is
suitable for use in most common cases addressed by the design. And the
partitioned approach to data usage patterns does not overly complicate the
design in either partition, nor seek an "ideal" approach to address both
problems - thus optimizing the overall design for real-world usage patterns.
This archive was generated by hypermail 2.1.6 : Sun Feb 22 2004 - 23:33:07 PST