Lecture: File system

preparation

read the xv6 book: §7, File system
optional: file system crash-consistency model

administrivia

proposal due this Wednesday
lab alarm: don’t directly jump to a user-controlled address from the kernel - why?
lab fs is out

overview

goals
- data persistence across reboots
- resource naming & sharing
user-space programming interfaces
- file-system syscalls: open/close/read/write/fsync/link/unlink/…
- mmap syscalls: memory rw/mmap/munmap/msync
- others: async I/O (e.g., Windows)
questions
- why fd? how about syscalls using file names only?
- given a fd, can you get the corresponding file name?
- can multiple directories “contain” the same file?
- what happens on disk after running the following code?
- you should be able to describe the details after this lecture

int fd = open("/d/f", O_RDWR | O_CREAT, 0644);
write(fd, "hello", 5);
close(fd);

I/O stacks

real-world I/O stack: Linux
xv6 stack:
- per-process fd (kernel/sysfile.c)
- file layour using struct file (kernel/file.c)
- fs layers using struct inode (kernel/fs.c)
- block I/O (kernel/bio.c)
- disk driver (kernel/virtio_disk.c)
walkthrough: write() → sys_write() → filewrite() → writei() → bread()/bwrite() → virtio_disk_rw()
- what does argfd() do in sys_write()?
- how does filewrite() call different functions depending on the fd type?

block devices

storage medium
- tape
- floppy disk
- HDD - hard disk drive
- SSD - solid-state drive
- others:
  - 3D XPoint
  - blu-ray w/ robotic arms
  - DNA
disk controller
- connect storage devices to the bus
- may perform virtual-to-physical translation (e.g., Flash Translation Layer)
- examples
  - VirtIO: see kernel/virtio.h & kernel/virtio_disk.c
  - NVM Express / PATA (IDE) / SATA/AHCI / …
  - read specs & write drivers
device abstraction: block device
- terminology: block vs sector
- data ← read(blockno)
- write(blockno, data)
- flush (ignore for now)
- trim (ignore for now)

data structures

questions
- how to locate /d/f: how to locate the root directory, directory d, and file f
- how to append “hello” to /d/f
- how to overwrite /d/f with new data “world”
- how to create a new file /newfile
xv6 disk layout: [ ... | superblock | log | inode blocks | free bitmap | data blocks ]
- ignore log for now
superblock: struct superblock in kernel/fs.h
- metadata about the entire file system
- often located at a pre-defined location (xv6: block 1)
- contain pointers to other parts of the file system
free bitmap: balloc/bfree in kernel/fs.c
- track free/in-use data blocks
- bit i in the free bitmap indicates whether data block i is free
- questions
  - how to check if data block i is free
  - how to change it from free to in-use
inode: struct dinode in kernel/fs.h
- file metadata: type, size, (permission, time, …)
- contain pointers to data blocks: 12 direct blocks and 1 (single) indirect block
  - goal: map (inode number, offset) to disk block number
  - why indirect blocks? compare to page tables
- questions
  - does inode have a “name” field? why not?
  - what’s the maximum file size
  - given a file’s inode, how to read its n-th block - see readi/bmap in kernel/fs.c
file: inode + data blocks
directory: a special “file”
- content: struct dirents - (inode number, name) pairs
- map name to inode number

crash safety

problem: multi-step operations
- need to update multiple blocks
  - example: file create
    - allocate a new block
    - add dirent entry to parent directory
    - write file inode
  - which operations should happen first?
  - what happens if the operations don’t complete?
    - dirent entry points to uninitialized inode - reliablity & security
    - block allocated but not used - waste
    - compare to memory bugs: dangling pointers and leak
- how to tell if multi-block updates are incomplete?
- recovery: need to either continue with or undo incomplete updates
assumptions
- the system can crash at any arbitrary point
- fail-stop: the disk may miss writes
  - no arbitrary writes, no disk corruption
  - power failures, hardware failures, kernel panics
- single sector/block write is often atomic (by hw)
- ordering: can two writes be reordered? - more on this later
goal: crash safety
- the file system must be “usable” after reboot and recovery
- invariant: maintain consistent file system state
approaches
- write-ahead logging/journaling (xv6)
- other approaches
  - introduce redundancy: replications, checksums
  - best effort repair
  - sync metadata change + fsck (garbage collection)
  - copy on write
  - soft updates
write-head logging/journaling
- basic idea: “transaction”
  - log everything you intend to do before making any destructive changes
  - mark “done” in the log
  - make the changes
  - delete the log
  - if crash - try recovery upon reboot
    - “done” in log: redo writes from the log
    - no “done” in log: simply discard the log
  - how about crash while replaying the log
  - ensure disk won’t reorder these steps - need flush disk inbetween
  - xv6: begin_op/log_write/end_op (kernel/log.c)
- example: begin_op(); write(a1, v1); write(a2, v2); end_op();
  - what if the system crashes before end_op()?
  - what if the system crashes after end_op()?
  - what if the system crashes during recovery?
- quiz: see IV. Filesystem (solution)