More on Journaling

Cost of Journaling

a log containing transactions, each transaction groups multi-block updates as an atomic unit
data journaling: log data and metadata

good data integrity, strong guarantees
but need to write all updates twice: once to journal, once to their actual location on disk

metadata journaling: only log metadata

writing data twice can be very expensive (can be much larger than metadata)
only the metadata need to be consistent for filesys to function properly
inconsistent file data is not the end of the world...

but data still needs to be written somehow!
upon an fsync, issue writes for data blocks directly to their actual location
once all data blocks are written (hear back from disk), persist the log with metadata txn
respond to fsync, apply logged changes

this is the default ext4 configuration (ordered mode)

Transaction Granularity

easiest: one fs operation = one txn

every file system operation writes a txn to the log
what would the log look like if a process keeps appending 1 byte to a file?

many versions of the file inode are written to the log with different file size and modified time
but when the log is applied, only the latest version of the inode matters
inefficient logging :(

what if we group multiple operations into a single txn?

compound transaction: bundle several atomic updates into a single global transaction
no need to track multiple versions of inodes from each op

track a list of modified metadata (bitmaps, inodes), updates buffered in cache
upon fsync, writes the latest version of modified blocks to the log

ext4 has one global active txn at a time, txn commits every 30s or when fsync is called
less to log and replay! any downside to this?

recall that we still need to write data blocks out before persisting the txn in the log
compound txn means that the caller of fsync is responsible for persisting all files in the txn
what are some bad pairing of applications that can cause lots of interference?

logical vs physical logging

so far we've only considered txns that log content of updated blocks (physical logging)
this makes recovery simple and idempotent but difficult to disentangle changes from individual file ops
what if we instead log a higher level operation instead of the changed block itself?

logical logging!
example: write down new extent that's allocated, use that to derive changes to bitmaps

smaller txn, easier to separate changes from different files
longer and more complex recovery procedure

Requests Reordering

recall that concurrent disk writes can be reordered
to persist the log, we might need to issue a number of writes

a write containing tx_begin
writes containing updated version of blocks
a block containing tx_end

blocks containing begin and end may be written before the updated blocks are logged
what if the computer crashes at this point?
is the transaction still atomic? how can we fix this?