More FS Design
Log Structured File System
- optimizes for write performance on disk by amortizing seek time
- assumes most read requests can be served by the buffer cache
- didn't we already know a filesys designed for good performance on disk?
- FFS: designed for disk, block group placement heuristics
- inode bitmap, inode, block bitmap, data block, all locate in different blocks within a block group
- access non contiguous blocks within a block group => seek time is reduced but still present
- short seeks & rotational delays => falls short of utilizing peak transfer bandwidth
- LFS goal: optimizes write performance by mostly performing large sequential writes
- the larger the write size, the more we can amortize seek time and rotation delay
- but how can we almost only do sequential writes?
- when new blocks are needed, we can allocate a contiguous chunk if available
- maybe we can place inode right next to its first data block, instead of in a separate inode table
- but will these always result in sequential writes? what happens when we extend a file?
- to achieve sequential writes, LFS appends all updates sequentially to a log
- normally we use log to journal updates and then apply the logged blocks to their actual location
- no fixed location for metadata or data in LFS, inode's location is where its latest version is in the log
- this is effectively copy on write!
- accumulates enough updates in memory before writing to disk to ensure a large write size
- each write is then a sequential write to a large number of blocks => a segment
- supporting copy on write updates
- ok... but how do we find inodes if they keep changing locations and scattered in the log?
- worse yet, when an inode moves, fs also needs to update dir entry pointing to the inode
- when the dir entry is updated, also needs to update the dir inode... updates all the way up!
- recursive update problem!
- can be solved by introducing a level of indirection
- recall that SSD also avoids in-place update and has a similar problem
- uses logical block address as a level of indirection to hide the acutal location of a page
- LFS uses inode map (inum => block #) as the indirection, pointer to inode uses inum
- when an inode changes, inode map's mapping changes as well (part of the update)
- but wait, how do we find the inode map?
- there is a structure that lives in fixed location on disk after all: the checkpoint region (CR)
- CR tracks the inode map loc, segment usage table loc, and the last checkpoint segment in the log
- CR is updated periodically (30s), checkpoints the state of the file system at that time
- the entire inode map is cached in memory, cached map has the latest mapping
- is it possible for the cached inode map to be different than the one tracked by CR?
- given what we know so far, how does LFS support a read given an inode number?
- checks cached inode map to find latest location of the inode, reads the inode
- finds data block location using the inode's data layout info
- given what we know so far, how does LFS support a write given an inode number?
- write new data block, new inode, new piece of imap to track inode's new location to the log
- what about the checkpoint region? not updated on every segment write, updated per checkpoint interval
- segment usage management
- LFS doesn't have data bitmap, so how does it track what segments are free on disk?
- log is a circular log written sequentially, if tail hasn't reached the head, segments in between are free
- but what happens when log is full? garbage collection (GC)!
- since all updates are written to the log, log contains different versions of metadata & data
- old versions (prior to checkpoint) are garbage and can be reclaimed by the log to free up more segments
- within a single segment, some blocks might be live and others might be garbage
- need to identify live blocks within each segment, then compact live blocks from multiple segments into a new one
- this would free up the old segments for the log to write in the future
- GC needs space to perform compaction, so actually, don't wait till the log is full to GC
- how to identify live blocks within a segment?
- blocks might be: inode block, inode map block, data block
- inode & inode map block: check whether it's referenced by inode map in CR
- data block: live if referenced by its inode's data layout
- needs a back pointer from data block to inode and file offset
- segment summary: identifies content of segement, tracks inum & offset for each block
- crash recovery
- on start up, use CR to restore a consistent filesys state
- how is CR persisted?
- CR may span over a number of blocks, updating CR as a unit is like writing a txn
- CR begin (timestamp) | CR updates | CR end (timestamp)
- if end and begin doesn't have matching timestamp, not a valid CR
- updating CR in place could cause a previously consistent CR to be inconsistent
- so LFS actually has 2 CRs at fixed location
- when writing out a new CR, write to the other location
- if crash while writing a new CR => new CR is invalid, old CR is consistent and used
- if crash after writing a new CR => new CR is used because of newer timestamp
- CR is updated periodically, though it's consistent, the fs version might be too stale
- CR tells us last checkpoint segment
- if more segments are logged but not checkpointed, see if they are valid
- if valid, we can roll forward the CR by applying updates (inode map loc) logged by the newer segments