Crash Consistency
Persistence
- filesystem components
- block bitmap, inodes, data blocks
- how are these affected upon an append? (assume one new data block is allocated as a result)
- block bitmap: allocates a new block for appended data
- file inode: updates file size, modified time, data layout
- data block: writes appended data to the newly allocated data block
- changes must be written back to the disk to be persisted
- how many disk writes are needed to perform these updates?
- what if the computer crashes before finish all the writes?
- what if only the block bitmap is written successfully?
- what if only the file inode is written successfully?
- what if only the data block is written successfuly?
- may result in inconsistent file system... how do we resolve this?
Provide FS Consistency Across Crashes
- contracts with storage devices
- page/sector level atomicity
- if a write to sector succeeds, all 512 bytes have been updated, if it fails, no bytes have been updated
- will never see a partial write (a part of the sector has new data while the other part has old data)
- concurrent disk requests might be reordered
- if we tell disk to do 3 writes (updating each structure), disk can do them in any order
- when a disk request is served, an interrupt is issued to notify the completion
- resolve inconsistency: the file system checker (fsck)
- scans all metadata on the storage device, resolve inconsistency
- done on start up or recovery
- superblock: sanity check, make sure it's not corrupted
- block bitmap: use information from the inodes to build a consistent bitmap
- inodes: validate metadata format (could be corrupted from disk conrruption), clear out invalid inodes
- filesys invariants: inodes shouldn't contain the same data block, directory should have . and .. and so on
- avoid inconsistency: journaling / write-ahead logging (WAL)
- the root cause of inconsistency is a mismatch of atomicity
- we would like to persist all updates from an operation as an atomic unit
- but disk only provides page/sector level atomicity
- what if we build an abstraction to group a number of changes as one unit?
- transaction
- API:
tx_begin
, updated version of b1, updated version of b2, updated version of b3, tx_commit
- instead of applying changes directly, record changes elsewhere
- only when the entire transaction is persisted would we actually apply the changes to their locations
- why does this approach avoid inconsistency?
- one transaction can span across any number of blocks
- what happens if crash happens while writing the transaction?
- what happens if crash happens while applying the actual changes?
- journaling/WAL
- reserve space for log on disk
- for each operation, write its updates as a transaction to the log first
- persist the log, once it's persisted on disk, can apply the changes to their actual locations
- on start up/recovery, apply transactions recorded in the log
- other solutions
- soft updates: order updates to metadata to avoid inconsistency
- copy on write: never overwrite data in place, change a pointer to reflect updates in group
More on Journaling
- what to log?
- what do we record for each transaction?
- data journaling: log changes for data and metadata
- for each operation, we now have to write the data twice: once to journal, once to the actual location
- metadata journaling: only log changes for metadata
- writing data twice can be very expensive (can be much larger than metadata)
- only the metadata need to be consistent for filesys to function properly
- inconsistent user data is not the end of the world...
- issue writes for data blocks directly to their actual locations
- once all data blocks are written (hear back from disk), log the metadata changes, persist the log
- persist the log, apply logged changes to their locations
- which one is more commonly used?
- request reordering
- recall that concurrent disk writes can be reordered
- to persist the log, we might need to issue a number of writes
- a write containing
tx_begin
- writes containing updated version of blocks
- a block containing
tx_end
- blocks containing begin and end may be written before the updated blocks are logged
- what if the computer crashes at this point?
- is the transaction still atomic? how can we fix this?
- interaction with buffer cache
- purpose of buffer cache: cache disk blocks in memory for faster access
- when a block is requested for the first time, it's brought into the cache
- future reads/writes can be done on the cached block instead of going to the disk each time
- the reason why reads are sometimes blocking (if not in cache)
- what happens when the cache is full? eviction? what happens to dirty blocks?
- where does the log live? are log blocks cached?
- interaction with fsync
- if every file system call persists its changes immediately on disk, the system will be very slow (why?)
- purpose of fsync: let processes decide when changes to files/directories must reflect on disk
- other filesys operations (write, create, rename, etc.) can be done purely on cached blocks, no disk I/O needed
- filesys periodically (10-30s) flushes out changes on disk, but processes can also explicitly request persistence with fsync
fsync(fd)
: per file/dir operation
- flushes data and metadata of the file onto disk
- calling fsync on a newly created file does not necessarily persists changes to its parent directory (man page)
- how does this interact with the log?
- on fsync, all transactions related to the file must be written to the log and persisted
- with metadata journaling, data blocks are written out to their actual locations first, and then the log is persisted
- once log is persisted fsync can safely return to the caller, no need to wait for changes to apply to their actual locations, why?
- transaction size
- so far we've been assuming one transaction per file system operation
- every file system operation writes a transaction to the log, on fsync, all txns for the file must be persisted
- actually on fsync we don't pick and choose to only persist txns for the file
- all txns up to the last file txn should be persisted, why? (hint: shared metadata states!)
- if we write a txn for each filesys op, then we need to log the state of meatadata for each operation
- what would the log look like if a process keeps appending 1 byte to a file?
- many versions of the file inode are written to the log with different file size and modified time
- when log is applied, only the latest version of the inode matters
- what if we group multiple operations into a single txn?
- no need to track multiple versions of inodes from each op, just the latest version
- shared metadata (inode bitmap, block bitmap) only needs to be logged once
- ext4 has one global active txn at a time, txn commits every 30s or when fsync is called
- active txn tracks pointers to (cached) changed metadata blocks (bitmaps, inodes)
- less to log and replay!
- any downside to this?