Crash Consistency

File System Persistence Model

fs operations often update multiple persistent structures

create a new file

inode bitmap: allocate a new inode
inode table: initialize the new inode, update size for parent directory
data block: modify parent directory's data block to add the new directory entry

append to a file

data bitmap: allocate new blocks, bits changed to 1
data block: fill with the new data
inode table: update the file's inode with size, new data block, and other info (e.g. modified time)

interaction with block cache & inode cache

for performance reasons, most fs operations only update the cached blocks/inodes
if kernel provides strong consistency guarantees => make every operation persist synchronously

every file op is persistent on disk upon returning, write through caches
super slow! lots of disk writes to bitmaps and inodes

or let user explicitly request for persistence!

sync: write/persist all cached dirty blocks to disk
fsync(fd): persist all cached changes (metadata & data) of a file to disk
all other operations can be done purely on cached blocks without disk I/O
kernel periodically (10-30s) flushes dirty blocks to disk

Crash Consistency

upon a fsync, we write out changes related to a file
given a file append, 3 disk writes (inode, data bitmap, data block) are needed to persist the file

each write is in a different sector
writes may be re-ordered by the disk (inode may be written prior to data bitmap and vice versa)

what happens if the computer crash before finishing all the writes?

what if only the block bitmap is written successfully?
what if only the file inode is written successfully?
what if only the data block is written successfuly?

may result in inconsistent file system... how do we resolve this?
resolve inconsistency: the file system checker (fsck)
- scans all metadata on the storage device, resolve inconsistency
avoid inconsistency: journaling / write-ahead logging (WAL)

the root cause of inconsistency is a mismatch of atomicity

we would like to persist all updates from an operation as an atomic unit
but disk only provides page/sector level atomicity
what if we build an abstraction to group changes to multiple blocks/sectors as one unit?

transaction

APIs: tx_begin, tx_write(block) tx_commit
example: tx_begin, tx_write(updated_inode), tx_write(updated_bitmap), tx_write(updated_data) tx_commit
reserve space for log on disk
instead of applying changes to their actual disk location, record transaction to the log first
only when the entire transaction is persisted would we actually apply the changes to their locations
replay the committed transaction from the log on recovery

why does this approach avoid inconsistency?

what happens if crash happens while writing the transaction to the log?

no changes to the actual inode, bitmap and data block

what happens if crash happens while applying the actual changes?

transaction must have been committed in the log before we would apply the changes
on start up, recover by replaying the logged transactions to their actual location

cost of logging

for each operation, we now have to write the data twice: once to journal, once to the actual location

this is called data journaling (writing down both data and metadata)

fs just cares about consistency of the metadata, so just log the metadata!

metadata journaling, only writes metadata twice, write data directly to its location
default ext4 journaling mode (ordered)!