Jim Shearers review of Log-Structured File System (Sprite)

From: shearerje_at_comcast.net
Date: Sat Feb 21 2004 - 17:58:50 PST

Next message: Reid Wilkes: "Logging File System Paper Review"

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Historical file systems such as Unix store specific file segments in fixed locations on disk such that an update to a file segment always writes to the same disk location (note that a large file may still be segmented and chained all over the platter). A result of this approach is that a disk access is mostly comprised of seeking the desired segment and its related data structures. Unix, for example, spends only 5-10% of a write actually writing. The rest of the time it is seeking.

Sprite mitigates this problem by writing all changes in a running “log” much the way a write-once CD just keeps adding onto the end. The only fixed-location artifacts are the Superblock definition of the media and a checkpoint region that contains a pointer to the end of the log. When a file segment is “rewritten”, it is actually just written out at the end of the log along with a new copy of the file’s inode map. The most current inode map is at the end of the log and indicates which segments on disk hold the most current file data. Reading a file requires following the inode chain backward to find all of its segments, but this is not significantly different than following the inode information in a conventional file system.

Two major challenges in Sprite were the recovery of free space and ensuring data integrity after a system crash.

When a data item is rewritten to disk, it is written in a new location (the end of the log) and its prior existence is now garbage. Sprite invokes in the background a cleaning mechanism that searches for segments in which there is no remaining “live” data and threads these segments back into the usable segment list. The cleaning mechanism also identifies mostly garbage segments and copies the remaining live data to the end of the log as part of the usual write cycle. This clears out the segment in question so that it can be reused and packs the live data at the end of the log. A segment summary block identifies the contents to facilitate live-data identification and cleaning.

The paper introduces a “write cost” metric for comparing cleaning policies and uses this to show that the optimal cleaning policy is driven by the volatility of the data. The cleaner favors re-packing slowly changing files into segments because they won’t need to be re-packed again for a long time. On the other hand, if highly dynamic files are just left alone for a little while, they’ll probably die on their own and empty out the segment. The cleaner differentiates these cases by looking at how long it has been since the last time the file was changed.

Crash recovery is supported in two steps. First, periodic check-points are added to the log to tag the end of a consistent set of transactions. Second, data is written out in an order to support “roll-forward” recovery of much of the data since the last check point. The checkpoint data itself is written to alternating sides of a ping-pong buffer (that thing that is at a fixed location as mentioned above). Thus, a crash during a checkpoint write can itself be recovered from by rolling forward from the other checkpoint. Roll-forward is supported by always writing out first the data itself, then a “directory operation log”, then the inode, then the directory entry. As long as the write succeeded through the inode portion, the segment up through this piece of write data is recoverable.

Testing shows significant performance gains are achievable over conventional file systems in today’s fast CPU large RAM computers. I expect that in the near future “conventional” file systems will become as rare as “conventional” sequence-based programming languages or “conventional” (third wheel in the rear) aircraft landing gear.

Next message: Reid Wilkes: "Logging File System Paper Review"

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Sat Feb 21 2004 - 17:58:57 PST