Review: Rosenblum & Ousterhout, The Design and Implementation of a Log-Structured File System

From: Ian King (iking_at_killthewabbit.org)
Date: Sun Feb 22 2004 - 23:08:35 PST

  • Next message: Richard Jackson: "Review: Mendel Rosenblum and John K. Ousterhout. The Design and Implementation of a Log-Structured File System."

    The authors describe a design for a file system that is built on a log of data
    written to the disk together with the resulting metadata; unlike file systems
    that use logging for crash recovery, the log is the entirety of disk activity.
    There is discussion of tuning parameters for the file system, followed by two
    sets of performance data: one based on theoretical considerations, one
    empirically derived from use of the file system by real-world users.

    This paper is quite captivating because it takes something as seemingly prosaic
    as filesystem structure and puts a whole new spin on it. (I had to read it more
    than once.) As the authors describe, there are many file systems that use a
    transaction log to support crash recovery, but these researchers saw that the
    entirety of the data written to the disk was in fact in the log - so why not use
    the log as the actual repository? With surprisingly little complexity, they
    designed a strategy to keep this log current with changes to data and the
    supporting metadata, such as directories and data pointers (a/k/a/ inodes -
    borrowed from UNIX parlance). This approach borrows from the strategy used with
    write-once media, which needs to update its metadata if new data is added to the
    media.

    However, unlike write-once media, rotating magnetic media may see its data
    change or be deleted; with write-once media, this is seen as lost space, but
    magnetic media needs to be more accommondating. Therefore, the authors
    developed a strategy in which storage can be "cleansed" of no-longer-used blocks
    and obsolete metadata; further, they placed the data/metadata in "segments" on
    the device so that such cleansing can take place in small regions - with
    introduced latency that can be hidden - rather than across the entire storage
    device, introducing latency seen in conventional file systems of tens of minutes
    or longer.

    In further analysis, the authors noted that some blocks will rarely change,
    while others are quite volatile. They designed two separate "cleansing"
    strategies for each type. In developing their algorithms, the authors discover
    some surprising results, e.g. that locality can sometimes work against the
    intuitive approach. Their crash recovery is somewhat permissive of modest loss,
    but is also amazingly quick; they admit to somewhat naive testing of this
    feature, but the file system seems quite robust to random power failures.
    (Obviously, this approach does not address media failures.)

    The authors develop two sets of performance metrics: the first is based on the
    theoretical considerations that drove their design, and performance is seen to
    be modest, perhaps even discouraging in some cases. However, when evaluating
    the experimental implementation in a real-world environment, with multiple users
    of multiple programs, the results are much more encouraging, suggesting that
    while actual usage patterns are not well-modeled by their (somewhat simplistic)
    testing assumptions, the performance under actual usage vindicates the design.
    The real-world scenario includes a wide variety of usage, including a swap
    partition for a UNIX system. The authors point out that their "tuning"
    parameters are likely not optimal, but nonetheless the file system deports
    itself with distinction, at least matching if not outperforming a mature and
    tuned UNIX file system implementation (the reference benchmark throughout the
    paper).

    The resulting file system is very well suited to the use of rotating magnetic
    media and large in-memory file caches; by favoring writes of large multi-sector
    data chunks with few seek operations, it optimizes the model of conventional
    rotating media. By seeking a "good" solution - potentially allowing some loss
    of recently added data - rather than a "perfect" solution, the solution is
    suitable for use in most common cases addressed by the design. And the
    partitioned approach to data usage patterns does not overly complicate the
    design in either partition, nor seek an "ideal" approach to address both
    problems - thus optimizing the overall design for real-world usage patterns.


  • Next message: Richard Jackson: "Review: Mendel Rosenblum and John K. Ousterhout. The Design and Implementation of a Log-Structured File System."

    This archive was generated by hypermail 2.1.6 : Sun Feb 22 2004 - 23:33:07 PST