Lecture: File system

preparation

read the xv6 book: §8, File system
optional: file system crash-consistency model

overview

goals
- resource naming & sharing
- data persistence across reboots
user-space programming interfaces
- review CSE 333 on low-level I/O
- file-system syscalls: open/close/read/write/fsync/link/unlink/…
- mmap syscalls: memory rw/mmap/munmap/msync
questions
- why fd? how about syscalls using file names only?
- given a fd, can you get the corresponding file name?
- can multiple directories “contain” the same file?
- what happens on disk after running the following code? you should be able to describe the details after this/next week

int fd = open("/f", O_RDWR | O_CREAT, 0644);
write(fd, "hello", 5);
close(fd);

storage devices

storage medium
- tape
- floppy disk
- HDD - hard disk drive
- SSD - solid-state drive
- others:
  - 3D XPoint
  - blu-ray w/ robotic arms
  - DNA
disk controller
- connect storage devices to the bus
- may perform virtual-to-physical translation (e.g., Flash Translation Layer)
- examples
  - VirtIO: see kernel/virtio.h & kernel/virtio_disk.c
  - NVM Express / PATA (IDE) / SATA/AHCI / …
  - read specs & write drivers
device abstraction: block device
- terminology: block vs sector
- data ← read(blockno)
- write(blockno, data)
- flush (ignore for now)

I/O stacks

how to provide the file-system API on top of storage devices?
real-world I/O stack: Linux
file read on xv6: read() → sys_read() → fileread() → readi() → bread() → virtio_disk_rw()
- I/O stack: see Figure 8.1: Layers of the xv6 file system of the xv6 book
  - per-process fd (kernel/sysfile.c)
  - file layer using struct file (kernel/file.c)
  - fs layer using struct inode (kernel/fs.c)
  - block I/O (kernel/bio.c)
  - disk driver (kernel/virtio_disk.c)

data structures

questions
- how to locate file /f and read its content? (in lecture)
- how to append “hello” to /f?
- how to overwrite /f with new data “world”?
- how to create a new file /newf?
- will cover the first question in lecture; do the rest yourself
xv6 disk layout (kernel/fs.h):
- split the disk into 1-KiB blocks
- [ ... | superblock | log | inode blocks | free bitmap | data blocks ]
- ignore log for now (next lecture)
superblock: struct superblock in kernel/fs.h
- metadata about the entire file system
- often located at a pre-defined location (xv6: block 1)
- contain pointers to other parts of the file system
inode: struct dinode in kernel/fs.h (see Figure 8.3: The representation of a file on disk of the xv6 book)
- file metadata: type, size, (permission, time, …), packed in the region of inode blocks
- inode number (ino): index into inode blocks
  - unique name for files on disk
  - ino 1 usually refers to the root directory
- addrs contain pointers to data blocks: 12 direct blocks and 1 (singly) indirect block
  - why indirect blocks? hold more block addresses
  - how many block addresses does one indirect block hold? 1024 / 4 = 256
- how to read the n-th block of an inode: see readi/bmap in kernel/fs.c
  - n = 0..11: read block at address ->addrs[n]
  - n >= 12: load the block address from the indirect block, and read block at that address
types of inodes
- regular file: inode + data blocks
- directory: a special type of file with predefined format
  - content: struct dirents - (inode number, name) pairs
  - map names to inode numbers
  - look at how user/ls.c lists files under a directory
q: how to locate & read /f?
- locate the root directory /:
  - start from superblock: block 1 (well-known address)
  - find inode blocks from the superblock
  - find the inode of the root directory (ino 1) from inode blocks
- locate /f:
  - read the data blocks of the root directory
  - find the struct dirent where the name is f - now we have the ino of /f
  - find the inode of /f using the ino from the previous step
- read /f:
  - convert offset to in-file block address
  - read the corresponding data block given a block address
q: what’s the maximum file size on xv6? 12 + 256 = 268 KiB
q: how can we increase the limit on file size?
- can use doubly indirect blocks
- say we change each inode to hold 11 direct blocks, 1 singly indirect block, and 1 doubly indirect block: 11 + 256 + 256 * 256 = 65,803 KiB ≈ 64 MiB
free bitmap: balloc/bfree in kernel/fs.c
- track free/in-use data blocks
- bit i in the free bitmap indicates whether data block i is free
- questions
  - how to check if data block i is free
  - how to change it from free to in-use

crash safety

recap: a file system with file /f under the root directory
- ino 1: root directory /
  - inode contains data block addresses for the content
  - data block contains a struct dirent (2, “f”)
- ino 2: /f (empty file)

      inodes
      +---------------+
      | inode 1:      |
      | type = T_DIR  |
      | sz = 48       |
      | addrs[0] = 0  | --+
      | ...           |   |
      +---------------+   |
  +-> | inode 2:      |   |
  |   | type = T_FILE |   |
  |   | sz = 0        |   |
  |   | ...           |   |
  |   ~~~~~~~~~~~~~~~~~   |
  |                       |
  |   data blocks         |
  |   +---------------+   |
  |   | data block 0: | <-+
  |   | ...           |
  +-- | (2, "f")      |
      | ...           |
      ~~~~~~~~~~~~~~~~~

example: how to create a new file /newf?
- require updating multiple places on disk:
  - inode 1: sz (accounting for the size of a new struct dirent)
  - inode 3: new inode for /newf
  - data block 0: a new entry (3, “newf”)
- which write should go first?
  - goal: crash safety; the file system must be “usable” after reboot and recovery
  - assumptions:
    - the system can crash (poweroff) at any arbitrary point
    - the disk may miss writes, but won’t write aribtrary data
  - what if we write inode 3 first, and the system crashes after that (before we write inode 1 or data block 0)? leaking inode 3
  - what if we write inode 1 and data block 0 first, and the system crashes after that (before we write inode 3)? dangling pointer
  - what’s the maximum number of disk writes required for creating a new file?

      inodes
      +---------------+
      | inode 1:      |
      | type = T_DIR  |
      | sz = 64       |
      | addrs[0] = 0  | --+
      | ...           |   |
      +---------------+   |
  +-> | inode 2:      |   |
  |   | type = T_FILE |   |
  |   | sz = 0        |   |
  |   | ...           |   |
  |   +---------------+   |
+---> | inode 3:      |   |
| |   | type = T_FILE |   |
| |   | sz = 0        |   |
| |   | ...           |   |
| |   ~~~~~~~~~~~~~~~~~   |
| |                       |
| |   data blocks         |
| |   +---------------+   |
| |   | data block 0: | <-+
| |   | ...           |
| +-- | (2, "f")      |
+---- | (3, "newf")   |
      | ...           |
      ~~~~~~~~~~~~~~~~~

problem: multi-step operations
- need to update multiple blocks atomically
- approaches
  - best effort repair: fsck
  - redundancy: replications, coding
  - xv6: write-ahead logging/journaling
write-head logging/journaling
- log everything you intend to do before making any destructive changes
- mark “done” in the log
- make the changes
- reset the log after changes are done
- if crash - recover from log upon reboot
  - “done” in log: redo writes from the log
  - no “done” in log: simply discard the log
- xv6: begin_op/ … /end_op (kernel/log.c)
example: write to block address a1 with data v1 and a2 with v2
- begin_op(); write(a1, v1); write(a2, v2); end_op();
- what if the system crashes after each step?
- what if the system crashes during recovery?

log header     | payload (LOGSIZE blocks)
+--------------+-------------------------
| n=0          | ...
+--------------+-------------------------

write data v1 and v2 (log is still considered empty given n=0 in header):
+--------------+----+----+---------------
| n=0          | v1 | v2 | ...
+--------------+----+----+---------------

update log header with # of writes and addresses (log is considered complete):
+--------------+----+----+---------------
| n=2 (a1, a2) | v1 | v2 | ...
+--------------+----+----+---------------

... write a1 with v1 and a2 with v2 ...

reset log:
+--------------+-------------------------
| n=0          | ...
+--------------+-------------------------

Q&A

real-world I/O stack is complex
- Linux
- more sophisticated in cloud systems
syscalls (POSIX)
- semantics are tricky
- example: update a new file
  - why not just directly writing to file - can leave an incomplete new file if crash in the middle
  - does this guarantee that users will see either the old or the new file?
- POSIX is underspecified
  - the behavior may vary across file systems
  - safe fix: insert fsync(fd) before close
  - maybe even fsync the directory
- if you are interested, learn more about crash consistency and O_TMPFILE (Linux)

int fd = open("file.tmp", ...);
write(fd, newdata, newdatasize);
close(fd);
rename("file.tmp", "file");

journaling
- ideas
  - step 1: write a “todo” list before destructive updates
  - step 2: replay the “todo list”
  - can you use an “undo” list instead?
- examples: ext3/ext4 (Linux), NTFS (Windows), HFS+ (macOS)
- example: LevelDB, key-value store - not a file system, but share many ideas
- downsides
  - write twice (once in the journal and once for the actual data)
  - performance (log)
  - how about running LevelDB on top of ext4?
copy-on-write (COW)
- don’t do destructive updates; reduce updates to one single write
- example: log-structured file systems
  - conceptually, the entire file system is a log
  - over-simplified example: see Figure 2
  - pros and cons?
  - see the original paper: The Design and Implementation of a Log-Structured File System, from SOSP 1991
  - also see this LWN article: Log-structured file systems: There’s one in every SSD
- example: COW btrees
- examples: Btrfs (Linux), ReFS (Windows), APFS (macOS)