Crash Consistency

Persistence

filesystem components

block bitmap, inodes, data blocks
how are these affected upon an append? (assume one new data block is allocated as a result)

block bitmap: allocates a new block for appended data
file inode: updates file size, modified time, data layout
data block: writes appended data to the newly allocated data block

changes must be written back to the disk to be persisted

how many disk writes are needed to perform these updates?
what if the computer crashes before finish all the writes?
what if only the block bitmap is written successfully?
what if only the file inode is written successfully?
what if only the data block is written successfuly?
may result in inconsistent file system... how do we resolve this?

Provide FS Consistency Across Crashes

contracts with storage devices

page/sector level atomicity

if a write to sector succeeds, all 512 bytes have been updated, if it fails, no bytes have been updated
will never see a partial write (a part of the sector has new data while the other part has old data)

concurrent disk requests might be reordered

if we tell disk to do 3 writes (updating each structure), disk can do them in any order

when a disk request is served, an interrupt is issued to notify the completion

resolve inconsistency: the file system checker (fsck)
- scans all metadata on the storage device, resolve inconsistency
avoid inconsistency: journaling / write-ahead logging (WAL)

the root cause of inconsistency is a mismatch of atomicity

we would like to persist all updates from an operation as an atomic unit
but disk only provides page/sector level atomicity
what if we build an abstraction to group a number of changes as one unit?

transaction

API: tx_begin, updated version of b1, updated version of b2, updated version of b3, tx_commit
instead of applying changes directly, record changes elsewhere
only when the entire transaction is persisted would we actually apply the changes to their locations
why does this approach avoid inconsistency?

one transaction can span across any number of blocks

what happens if crash happens while writing the transaction?
what happens if crash happens while applying the actual changes?

journaling/WAL

reserve space for log on disk
for each operation, write its updates as a transaction to the log first
persist the log, once it's persisted on disk, can apply the changes to their actual locations
on start up/recovery, apply transactions recorded in the log

other solutions

soft updates: order updates to metadata to avoid inconsistency
copy on write: never overwrite data in place, change a pointer to reflect updates in group

More on Journaling

what to log?

what do we record for each transaction?
data journaling: log changes for data and metadata

for each operation, we now have to write the data twice: once to journal, once to the actual location

metadata journaling: only log changes for metadata

writing data twice can be very expensive (can be much larger than metadata)
only the metadata need to be consistent for filesys to function properly
inconsistent user data is not the end of the world...
issue writes for data blocks directly to their actual locations
once all data blocks are written (hear back from disk), log the metadata changes, persist the log
persist the log, apply logged changes to their locations

which one is more commonly used?

request reordering

recall that concurrent disk writes can be reordered
to persist the log, we might need to issue a number of writes

a write containing tx_begin
writes containing updated version of blocks
a block containing tx_end

blocks containing begin and end may be written before the updated blocks are logged
what if the computer crashes at this point?
is the transaction still atomic? how can we fix this?

interaction with buffer cache

purpose of buffer cache: cache disk blocks in memory for faster access
when a block is requested for the first time, it's brought into the cache
future reads/writes can be done on the cached block instead of going to the disk each time
the reason why reads are sometimes blocking (if not in cache)
what happens when the cache is full? eviction? what happens to dirty blocks?
where does the log live? are log blocks cached?

interaction with fsync

if every file system call persists its changes immediately on disk, the system will be very slow (why?)
purpose of fsync: let processes decide when changes to files/directories must reflect on disk

other filesys operations (write, create, rename, etc.) can be done purely on cached blocks, no disk I/O needed
filesys periodically (10-30s) flushes out changes on disk, but processes can also explicitly request persistence with fsync

fsync(fd): per file/dir operation

flushes data and metadata of the file onto disk
calling fsync on a newly created file does not necessarily persists changes to its parent directory (man page)

how does this interact with the log?

on fsync, all transactions related to the file must be written to the log and persisted
with metadata journaling, data blocks are written out to their actual locations first, and then the log is persisted
once log is persisted fsync can safely return to the caller, no need to wait for changes to apply to their actual locations, why?

transaction size

so far we've been assuming one transaction per file system operation
every file system operation writes a transaction to the log, on fsync, all txns for the file must be persisted

actually on fsync we don't pick and choose to only persist txns for the file
all txns up to the last file txn should be persisted, why? (hint: shared metadata states!)

if we write a txn for each filesys op, then we need to log the state of meatadata for each operation

what would the log look like if a process keeps appending 1 byte to a file?
many versions of the file inode are written to the log with different file size and modified time
when log is applied, only the latest version of the inode matters

what if we group multiple operations into a single txn?

no need to track multiple versions of inodes from each op, just the latest version
shared metadata (inode bitmap, block bitmap) only needs to be logged once
ext4 has one global active txn at a time, txn commits every 30s or when fsync is called
active txn tracks pointers to (cached) changed metadata blocks (bitmaps, inodes)
less to log and replay!
any downside to this?