From: Greg Green (ggreen_at_cs.washington.edu)
Date: Mon Feb 23 2004 - 14:46:46 PST
This paper describes a file system designed to optimize writes to
disk, as opposed to general read/write performance. The rationale is
that memories are so large, that most of the data needed can be read
off of disk and cached in memory. Therefore, if you ensure that writes
are done as fast as possible, the best utilization of bandwidth
between the disk and cpu is achieved.
The term log-structured means that the new blocks are written out to
the disk sequentially in a new free area, called a segment. The data
blocks are written first, then the inodes, then, the directories. This
is different than normal file systems which have the inodes in fixed
positions on the disk. This requires a data structure that maps an
inode number to the actual block on disk. A critical factor for
performance is that there must be sufficient free contiguous free
space on the disk or no performance gain will result. Files that are
changed a lot will cause a lot of fragmentation as the old changed
blocks are invalidated in various spots on the disk.
The design criteria and experiments that went into finding the best
garbage collector of segments was described next. This was found to be
a critical factor in the performance of the system. The best policy
was found to be a combination of the cost of cleaning the segment and
benefit of cleaning the segment. This was achieved using a ratio of
the free space generated * age of data over the utilization of the
segment.
The paper ended with a discussion of checkpointing and crash
recovery. Because of the log structure of the file system, after a
crash the system can roll-forward from the last checkpoint of data on
the disk, reconstructing the disk structures as far as possible.
I had heard of the term log-structured before, but had no idea really
what it meant. It seems that most new file systems on linux have a
log, but I understand that to mean just a meta-data log. So this
appears to be quite different. One of the fundamental decisions was
that losing data after a checkpoint is ok. I wonder whether that is
really true. I can conceive of a lot of situations where this would be
disastrous. On the other hand, I don't understand what a traditional
file system would do during a crash, so maybe that is already taken
into account.
This area of filesystems seems to me to need a lot of work. The
current ones work ok, but seem to have a lot of issues. The fact that
the bus is only 5% utilized is pretty bad. What is going to happen
when machines have 1TB disks? That day isn't too far away it seems.
-- Greg Green
This archive was generated by hypermail 2.1.6 : Mon Feb 23 2004 - 14:47:57 PST