From: Praveen Rao (psrao_at_windows.microsoft.com)
Date: Sun Feb 22 2004 - 18:59:39 PST
LFS paper discusses design of a file system which makes writes to the
disk very efficient in the face of a lot of small writes.
Authors begin with describing the motivation behind the work. As the
processors are getting faster and memory is getting bigger, disk access
becomes the bottleneck in the system. Among disk I/O the effect of read
are mitigated by large caches in the memory. Hence writes to disk
becomes the factor that bottlenecks performance. In addition, most of
the file operations are on small files. Making large file access
efficient has been well studied and is limited by bandwidths of memory
and I/O subsystems. Thus, this focus of LFS design is making the writes
of small files efficient while not worsening the performance for other
scenarios.
Authors cite the problems with current file systems. Current file
systems do too many disk I/Os for a single file creation because file
metadata is scattered around on the disk. For example Unix FFS does 5
accesses (2 for file attributes, 1 for file data, 1 for directory data,
1 for directory attributes). Consequently < 5% of disk bandwidth is used
for new data, rest of the time is spent in seeks. Moreover even though
the file data is written asynchronously, file metadata is written
synchronously. Network File Systems introduce even more synchronous
behavior. Such synchronous writes make an application performance disk
bound.
The core idea in LFS design is to buffer a sequence of file system
changes in the file cache and then writing all the changes to disk
sequentially in a single disk write operation. The information written
to the disk includes everything from data to the metadata attributes.
Effectively, for a workload that contains many small files, LFS converts
many small synchronous random writes into large asynchronous sequential
transfers that can utilize nearly 100% of raw disk bandwidth. This seems
to be a powerful idea.
Two issues arise in such design:
1) How to retrieve information from the log
2) How to manage free disk space so that large chunks are always
available for writing new data
First problem is solved relatively simply. In Unix FFS inodes are at a
fixed address on the disk (it can be determined where on the disk inode
of a file is located), making it possible to retrieve them most
efficiently. In LFS inodes are written to the log. To achieve a
comparable performance, LFS maintains inode maps to maintain the current
location of an inode, which are cached in the memory.
The second problem - free space management is more involved. As the log
gets bigger, the free space on the disk can become fragmented because of
file deletions and overwrites. There are two choices to alleviate the
fragmentation: threading and copying. LFS uses a combination of these
two approaches by dividing disk into large fixed sized segments. A
segment is always written sequentially. All the live data in the segment
must be copied out of the segment before segment can be rewritten. These
segments themselves are threaded together by the system.
Segment cleaning mechanism is a three step process:
* read a number of segments in memory
* identify live data
* write live data back to a smaller number of clean segments
For segment cleaning, 4 policy issues must be addressed:
1) When to run the cleaner
2) How many segments to clean
3) Which segments to clean
4) How to regroup the live blocks
It turns out that system performance is not very sensitive to the first
two factors but very sensitive to the last two. Experimental data shows
that LFS should use a cost-benefit cleaning policy to select segments.
Regrouping of data for better locality on the other hand turned out to
be a negative improvement. This counter-intuitive effect is due to the
fact that regrouped segments hove just over the cleaning threshold for a
longer time, don't get cleaned up and worsen the perf. For cost-benefit
segment selection LFS maintains a segment usage table that records:
* number of live bytes in the segment
* most recent write time to any byte within the segment
Crash recovery is much simpler and faster in LFS. In the traditional
file system the whole disk has to be scanned for inconsistencies. In LFS
changes are towards the end of the log. LFS uses checkpoints and
roll-forward for recovery. Check-pointing happens at a certain interval
to mark the state consistent. As such, during recovery it is sufficient
to just roll-back to the last checkpoint. Roll-forward improves this by
scanning for data and metadata changes and recovering larger amount of
changes.
Garbage collection in memory was ringing in my mind while reading about
segment cleaning. Authors draw the same comparison by comparing segment
cleaning in LFS to generational garbage collectors; though they point
out that memory GCs have the luxury of random access.
This archive was generated by hypermail 2.1.6 : Sun Feb 22 2004 - 19:00:46 PST