More FS Design

inode bitmap, inode, block bitmap, data block, all locate in different blocks within a block group
access non contiguous blocks within a block group => seek time is reduced but still present
short seeks & rotational delays => falls short of utilizing peak transfer bandwidth

LFS goal: optimizes write performance by mostly performing large sequential writes

the larger the write size, the more we can amortize seek time and rotation delay
but how can we almost only do sequential writes?

when new blocks are needed, we can allocate a contiguous chunk if available
maybe we can place inode right next to its first data block, instead of in a separate inode table
but will these always result in sequential writes? what happens when we extend a file?

normally we use log to journal updates and then apply the logged blocks to their actual location
no fixed location for metadata or data in LFS, inode's location is where its latest version is in the log

accumulates enough updates in memory before writing to disk to ensure a large write size
each write is then a sequential write to a large number of blocks => a segment

ok... but how do we find inodes if they keep changing locations and scattered in the log?
worse yet, when an inode moves, fs also needs to update dir entry pointing to the inode
when the dir entry is updated, also needs to update the dir inode... updates all the way up!
recursive update problem!

uses logical block address as a level of indirection to hide the acutal location of a page

LFS uses inode map (inum => block #) as the indirection, pointer to inode uses inum
when an inode changes, inode map's mapping changes as well (part of the update)

there is a structure that lives in fixed location on disk after all: the checkpoint region (CR)
CR tracks the inode map loc, segment usage table loc, and the last checkpoint segment in the log
CR is updated periodically (30s), checkpoints the state of the file system at that time
the entire inode map is cached in memory, cached map has the latest mapping

is it possible for the cached inode map to be different than the one tracked by CR?

write new data block, new inode, new piece of imap to track inode's new location to the log
what about the checkpoint region? not updated on every segment write, updated per checkpoint interval

LFS doesn't have data bitmap, so how does it track what segments are free on disk?
log is a circular log written sequentially, if tail hasn't reached the head, segments in between are free
but what happens when log is full? garbage collection (GC)!

since all updates are written to the log, log contains different versions of metadata & data
old versions (prior to checkpoint) are garbage and can be reclaimed by the log to free up more segments
within a single segment, some blocks might be live and others might be garbage
need to identify live blocks within each segment, then compact live blocks from multiple segments into a new one
this would free up the old segments for the log to write in the future
GC needs space to perform compaction, so actually, don't wait till the log is full to GC

needs a back pointer from data block to inode and file offset
segment summary: identifies content of segement, tracks inum & offset for each block

CR may span over a number of blocks, updating CR as a unit is like writing a txn
CR begin (timestamp) | CR updates | CR end (timestamp)
if end and begin doesn't have matching timestamp, not a valid CR

so LFS actually has 2 CRs at fixed location
when writing out a new CR, write to the other location
if crash while writing a new CR => new CR is invalid, old CR is consistent and used
if crash after writing a new CR => new CR is used because of newer timestamp

CR is updated periodically, though it's consistent, the fs version might be too stale

CR tells us last checkpoint segment
if more segments are logged but not checkpointed, see if they are valid
if valid, we can roll forward the CR by applying updates (inode map loc) logged by the newer segments