Crash Consistency

Persistence

filesystem data structures

inode bitmap, data bitmap, inodes, data blocks
stored on disk, but cached in memory, why?

filesystem operations

what structures are updated upon an overwrite? what about append?
changes to cached blocks need to be written back to disk
what if the computer crash during the write back?

Case study: Append

storage promises

block/sector level atomicity
concurrent disk requests might be reordered
when a request completes, an interrupt is issued to the host

what changes are needed?
- data bitmap: allocate new blocks
- inode: update size of file and new data location
- data block: write out actual data
updates are to different disk blocks, disk can crash at any point
what's the consequence of crashing after just updating one thing?
- metadata and/or data inconsistency
- the crash consistency problem!

Crash Consistency

goal: keep file system metadata consistent
solution 1: The File System Checker (fsck)
- scans all metadata on the storage device, resolve inconsistency
solution 2: Journaling / Write-ahead Logging

fundamental problem: atomicity mismatch
- want changes from one operation to all persist or all not to
- but disk only provides block level atomicity
- need an abstraction to group changes: transaction
transaction API: [begin] op1 op2 op3 ... [end/commit]
how to implement transaction on disk?

reserve a log space on disk (why?)
write transaction to the log first, persist them on disk
then apply the changes to their actual disk location
what happens if we crash while applying the actual changes?
what happens if we crash while writing the transaction?

other solutions

soft updates: order updates to metadata to avoid inconsistency
copy on write: never overwrite data in place, change pointer to reflect updates in group

Journaling

request reordering: how to order I/O for journaling?

recall that concurrent I/Os can be reordered
if we issue log writes with blocks containing begin, ops, end, what if the end block is written first?
how do we restrict ordering of I/O requests?
do we have to restrict ordering? is there anything else we can do?

recovery process: how to restore consistent state from the log

replay completed (with a tx_end) transactions from the log
what if we crash in the middle of recovering?

data vs metadata journaling: what to log?

data: for each operation, journal changes to data and metadata
metadata: for each operation, journal changes to metadata only

what about directory data? metadata or data?
but what about the data blocks? when do we write them?

which one is more commonly used?

other problems

granularity of transaction
what to do if the log is full?