Lecture 16: Crash consistency
Preparation
- Read OSPP §14, Reliable Storage.
administrivia
- Lab X proposal due today
- Lab 5 posted
- Lab 4 - know how to debug/get unstuck
- sections & office hours change
recap from last lecture
- block device
- general abstractions of storage devices
- read/write blocks
- flush - will see in this lecture
- on-disk data structures
- superblock, free map, inode blocks, data blocks
- example: how to locate
/d/f
- example: how to update the file
- example: how to append to the file
- example: how to create a new file
- problem: multi-step operations
- need to update multiple blocks
- example: file create
- allocate a new block
- add
dirent
entry to parent directory
- write file inode
- which operations should happen first?
- what happens if the operations don’t complete?
dirent
entry points to uninitialized inode - reliablity & security
- block allocated but not used - waste
- compare to memory bugs: dangling pointers and leak
- how to tell if multi-block updates are incomplete?
- recovery: need to either continue with or undo incomplete updates
overview
- assumptions
- the system can go down at any arbitrary point
- fail-stop: the disk may miss writes
- no arbitrary writes, no disk corruption
- power failures, hardware failures, kernel panics
- single sector/block write is often atomic (by hw)
- ordering: can two writes be reordered? - more on this later
- goal: crash safety
- the file system must be “usable” after reboot and recovery
- invariant: maintain consistent file system state
- approaches
- introduce redundancy: replications, checksums
- best effort repair
- sync metadata change +
fsck
(garbage collection)
- soft updates
- most of today’s file systems
- write-ahead logging/journaling (today)
- copy on write (guest lecture on btrfs)
logging/journaling
- basic idea: “transaction”
- log everything you intend to do before making any destructive changes
- mark “done” in the log
- do the changes
- delete the log
- if crash
- “done” in log: redo writes from the log
- no “done” in log: simply discard the log
- how about crash while replaying the log
- ensure disk won’t reorder these steps - need flush disk inbetween
- example: xv6
begin_op
log_write
end_op
- discussions
- why “write-ahead logging” style works
- performance pros/cons
- batch writes?
- concurrency?
- quiz
- suppose block 0 has value 1 initially
- read the value, bump it by one, write the new value back to disk
- repeat the above step
- what are the possible values if the machine crashes in the middle
- how about crash after the last step
- more examples