More on Journaling
Cost of Journaling
- a log containing transactions, each transaction groups multi-block updates as an atomic unit
- data journaling: log data and metadata
- good data integrity, strong guarantees
- but need to write all updates twice: once to journal, once to their actual location on disk
- metadata journaling: only log metadata
- writing data twice can be very expensive (can be much larger than metadata)
- only the metadata need to be consistent for filesys to function properly
- inconsistent file data is not the end of the world...
- but data still needs to be written somehow!
- upon an fsync, issue writes for data blocks directly to their actual location
- once all data blocks are written (hear back from disk), persist the log with metadata txn
- respond to fsync, apply logged changes
- this is the default ext4 configuration (ordered mode)
Transaction Granularity
- easiest: one fs operation = one txn
- every file system operation writes a txn to the log
- what would the log look like if a process keeps appending 1 byte to a file?
- many versions of the file inode are written to the log with different file size and modified time
- but when the log is applied, only the latest version of the inode matters
- inefficient logging :(
- what if we group multiple operations into a single txn?
- compound transaction: bundle several atomic updates into a single global transaction
- no need to track multiple versions of inodes from each op
- track a list of modified metadata (bitmaps, inodes), updates buffered in cache
- upon fsync, writes the latest version of modified blocks to the log
- ext4 has one global active txn at a time, txn commits every 30s or when fsync is called
- less to log and replay! any downside to this?
- recall that we still need to write data blocks out before persisting the txn in the log
- compound txn means that the caller of fsync is responsible for persisting all files in the txn
- what are some bad pairing of applications that can cause lots of interference?
- logical vs physical logging
- so far we've only considered txns that log content of updated blocks (physical logging)
- this makes recovery simple and idempotent but difficult to disentangle changes from individual file ops
- what if we instead log a higher level operation instead of the changed block itself?
- logical logging!
- example: write down new extent that's allocated, use that to derive changes to bitmaps
- smaller txn, easier to separate changes from different files
- longer and more complex recovery procedure
Requests Reordering
- recall that concurrent disk writes can be reordered
- to persist the log, we might need to issue a number of writes
- a write containing
tx_begin
- writes containing updated version of blocks
- a block containing
tx_end
- blocks containing begin and end may be written before the updated blocks are logged
- what if the computer crashes at this point?
- is the transaction still atomic? how can we fix this?