Crash Consistency
File System Persistence Model
- fs operations often update multiple persistent structures
- create a new file
- inode bitmap: allocate a new inode
- inode table: initialize the new inode, update size for parent directory
- data block: modify parent directory's data block to add the new directory entry
- append to a file
- data bitmap: allocate new blocks, bits changed to 1
- data block: fill with the new data
- inode table: update the file's inode with size, new data block, and other info (e.g. modified time)
- interaction with block cache & inode cache
- for performance reasons, most fs operations only update the cached blocks/inodes
- if kernel provides strong consistency guarantees => make every operation persist synchronously
- every file op is persistent on disk upon returning, write through caches
- super slow! lots of disk writes to bitmaps and inodes
- or let user explicitly request for persistence!
sync
: write/persist all cached dirty blocks to disk
fsync(fd)
: persist all cached changes (metadata & data) of a file to disk
- all other operations can be done purely on cached blocks without disk I/O
- kernel periodically (10-30s) flushes dirty blocks to disk
Crash Consistency
- upon a fsync, we write out changes related to a file
- given a file append, 3 disk writes (inode, data bitmap, data block) are needed to persist the file
- each write is in a different sector
- writes may be re-ordered by the disk (inode may be written prior to data bitmap and vice versa)
- what happens if the computer crash before finishing all the writes?
- what if only the block bitmap is written successfully?
- what if only the file inode is written successfully?
- what if only the data block is written successfuly?
- may result in inconsistent file system... how do we resolve this?
- resolve inconsistency: the file system checker (fsck)
- scans all metadata on the storage device, resolve inconsistency
- done on start up or recovery
- superblock: sanity check, make sure it's not corrupted
- inodes: validate metadata format (could be corrupted from disk conrruption), clear out invalid inodes
- block bitmap: use information from the inodes to build a consistent bitmap
- filesys invariants: inodes shouldn't contain the same data block, directory should have . and .. and so on
- avoid inconsistency: journaling / write-ahead logging (WAL)
- the root cause of inconsistency is a mismatch of atomicity
- we would like to persist all updates from an operation as an atomic unit
- but disk only provides page/sector level atomicity
- what if we build an abstraction to group changes to multiple blocks/sectors as one unit?
- transaction
- APIs:
tx_begin
, tx_write(block)
tx_commit
- example:
tx_begin
, tx_write(updated_inode)
, tx_write(updated_bitmap)
, tx_write(updated_data)
tx_commit
- reserve space for log on disk
- instead of applying changes to their actual disk location, record transaction to the log first
- only when the entire transaction is persisted would we actually apply the changes to their locations
- replay the committed transaction from the log on recovery
- why does this approach avoid inconsistency?
- what happens if crash happens while writing the transaction to the log?
- no changes to the actual inode, bitmap and data block
- what happens if crash happens while applying the actual changes?
- transaction must have been committed in the log before we would apply the changes
- on start up, recover by replaying the logged transactions to their actual location
- cost of logging
- for each operation, we now have to write the data twice: once to journal, once to the actual location
- this is called data journaling (writing down both data and metadata)
- fs just cares about consistency of the metadata, so just log the metadata!
- metadata journaling, only writes metadata twice, write data directly to its location
- default ext4 journaling mode (ordered)!