Lecture 16: Crash consistency

Preparation

block device
- general abstractions of storage devices
- read/write blocks
- flush - will see in this lecture
on-disk data structures
- superblock, free map, inode blocks, data blocks
- example: how to locate /d/f
- example: how to update the file
- example: how to append to the file
- example: how to create a new file
problem: multi-step operations
- need to update multiple blocks
  - example: file create
    - allocate a new block
    - add dirent entry to parent directory
    - write file inode
  - which operations should happen first?
  - what happens if the operations don’t complete?
    - dirent entry points to uninitialized inode - reliablity & security
    - block allocated but not used - waste
    - compare to memory bugs: dangling pointers and leak
- how to tell if multi-block updates are incomplete?
- recovery: need to either continue with or undo incomplete updates

assumptions
- the system can go down at any arbitrary point
- fail-stop: the disk may miss writes
  - no arbitrary writes, no disk corruption
  - power failures, hardware failures, kernel panics
- single sector/block write is often atomic (by hw)
- ordering: can two writes be reordered? - more on this later
goal: crash safety
- the file system must be “usable” after reboot and recovery
- invariant: maintain consistent file system state
approaches
- introduce redundancy: replications, checksums
- best effort repair
- sync metadata change + fsck (garbage collection)
- soft updates
- most of today’s file systems
  - write-ahead logging/journaling (today)
  - copy on write (guest lecture on btrfs)

basic idea: “transaction”
- log everything you intend to do before making any destructive changes
- mark “done” in the log
- do the changes
- delete the log
- if crash
  - “done” in log: redo writes from the log
  - no “done” in log: simply discard the log
- how about crash while replaying the log
- ensure disk won’t reorder these steps - need flush disk inbetween
example: xv6
- begin_op
- log_write
- end_op
discussions
- why “write-ahead logging” style works
- performance pros/cons
  - batch writes?
  - concurrency?
- quiz
  - suppose block 0 has value 1 initially
  - read the value, bump it by one, write the new value back to disk
  - repeat the above step
  - what are the possible values if the machine crashes in the middle
  - how about crash after the last step
more examples
- ext4, NTFS
- log-structured file systems: conceptually, the entire file system is a log
- leveldb: key-value store - not a file system, but share many ideas