Crash Consistency
Persistence
- filesystem data structures
- inode bitmap, data bitmap, inodes, data blocks
- stored on disk, but cached in memory, why?
- filesystem operations
- what structures are updated upon an overwrite? what about append?
- changes to cached blocks need to be written back to disk
- what if the computer crash during the write back?
Case study: Append
- storage promises
- block/sector level atomicity
- concurrent disk requests might be reordered
- when a request completes, an interrupt is issued to the host
- what changes are needed?
- data bitmap: allocate new blocks
- inode: update size of file and new data location
- data block: write out actual data
- updates are to different disk blocks, disk can crash at any point
- what's the consequence of crashing after just updating one thing?
- metadata and/or data inconsistency
- the crash consistency problem!
Crash Consistency
- goal: keep file system metadata consistent
- solution 1: The File System Checker (fsck)
- scans all metadata on the storage device, resolve inconsistency
- superblock: sanity check, make sure it's not corrupted
- inodes: consistency btwn inode array and bitmap, valid inodes (format, link count)
- free blocks: since inodes track loc of data blocks, we can derive usage info from inodes
- fs specific checks: inodes shouldn't contain the same data block, directory should have . and .. and so on
- solution 2: Journaling / Write-ahead Logging
- fundamental problem: atomicity mismatch
- want changes from one operation to all persist or all not to
- but disk only provides block level atomicity
- need an abstraction to group changes: transaction
- transaction API: [begin] op1 op2 op3 ... [end/commit]
- how to implement transaction on disk?
- reserve a log space on disk (why?)
- write transaction to the log first, persist them on disk
- then apply the changes to their actual disk location
- what happens if we crash while applying the actual changes?
- what happens if we crash while writing the transaction?
- other solutions
- soft updates: order updates to metadata to avoid inconsistency
- copy on write: never overwrite data in place, change pointer to reflect updates in group
Journaling
- request reordering: how to order I/O for journaling?
- recall that concurrent I/Os can be reordered
- if we issue log writes with blocks containing begin, ops, end, what if the end block is written first?
- how do we restrict ordering of I/O requests?
- do we have to restrict ordering? is there anything else we can do?
- recovery process: how to restore consistent state from the log
- replay completed (with a tx_end) transactions from the log
- what if we crash in the middle of recovering?
- data vs metadata journaling: what to log?
- data: for each operation, journal changes to data and metadata
- metadata: for each operation, journal changes to metadata only
- what about directory data? metadata or data?
- but what about the data blocks? when do we write them?
- which one is more commonly used?
- other problems
- granularity of transaction
- what to do if the log is full?