Lecture: File system
preparation
overview
  - goals
    
      - resource naming & sharing
- data persistence across reboots
 
- user-space programming interfaces
    
      - review CSE 333 on low-level I/O
- file-system syscalls: open/close/read/write/fsync/link/unlink/…
- mmap syscalls: memory rw/mmap/munmap/msync
 
- questions
    
      - why fd?  how about syscalls using file names only?
- given a fd, can you get the corresponding file name?
- can multiple directories “contain” the same file?
- what happens on disk after running the following code? you should be able to describe the details after this/next week
 
int fd = open("/f", O_RDWR | O_CREAT, 0644);
write(fd, "hello", 5);
close(fd);
storage devices
  - storage medium
    
  
- disk controller
    
      - connect storage devices to the bus
- may perform virtual-to-physical translation (e.g., Flash Translation Layer)
- examples
        
      
 
- device abstraction: block device
    
      - terminology: block vs sector
- data ← read(blockno)
- write(blockno, data)
- flush(ignore for now)
 
I/O stacks
  - how to provide the file-system API on top of storage devices?
- real-world I/O stack:
Linux
- file read on xv6: read()→sys_read()→fileread()→readi()→bread()→virtio_disk_rw()
      - I/O stack: see Figure 8.1: Layers of the xv6 file system of the xv6 book
        
          - per-process fd(kernel/sysfile.c)
- file layer using struct file(kernel/file.c)
- fs layer using struct inode(kernel/fs.c)
- block I/O (kernel/bio.c)
- disk driver (kernel/virtio_disk.c)
 
 
data structures
  - questions
    
      - how to locate file /fand read its content? (in lecture)
- how to append “hello” to /f?
- how to overwrite /fwith new data “world”?
- how to create a new file /newf?
- will cover the first question in lecture; do the rest yourself
 
- xv6 disk layout (kernel/fs.h):
      - split the disk into 1-KiB blocks
- [ ... | superblock | log | inode blocks | free bitmap | data blocks ]
- ignore logfor now (next lecture)
 
- superblock: struct superblockinkernel/fs.h
      - metadata about the entire file system
- often located at a pre-defined location (xv6: block 1)
- contain pointers to other parts of the file system
 
- inode:
struct dinodeinkernel/fs.h(see Figure 8.3: The representation of a file on disk of the xv6 book)
      - file metadata: type, size, (permission, time, …), packed in the region of inode blocks
- inode number (ino): index into inode blocks
        
          - unique name for files on disk
- ino 1 usually refers to the root directory
 
- addrscontain pointers to data blocks: 12 direct blocks and 1 (singly) indirect block- 
          - why indirect blocks? hold more block addresses
- how many block addresses does one indirect block hold? 1024 / 4 = 256
 
- how to read the n-th block of an inode: see readi/bmapinkernel/fs.c
          - n = 0..11: read block at address ->addrs[n]
- n >= 12: load the block address from the indirect block, and read block at that address
 
 
- types of inodes
    
      - regular file: inode + data blocks
- directory: a special type of file with predefined format
        
          - content: struct dirents - (inode number, name) pairs
- map names to inode numbers
- look at how user/ls.clists files under a directory
 
 
- q: how to locate & read /f?
      - locate the root directory /:
          - start from superblock: block 1 (well-known address)
- find inode blocks from the superblock
- find the inode of the root directory (ino 1) from inode blocks
 
- locate /f:
          - read the data blocks of the root directory
- find the struct direntwhere the name isf- now we have the ino of/f
- find the inode of /fusing the ino from the previous step
 
- read /f:
          - convert offset to in-file block address
- read the corresponding data block given a block address
 
 
- q: what’s the maximum file size on xv6? 12 + 256 = 268 KiB
- q: how can we increase the limit on file size?
    
      - can use doubly indirect blocks
- say we change each inode to hold 11 direct blocks, 1 singly indirect block, and 1 doubly indirect block:
11 + 256 + 256 * 256 = 65,803 KiB ≈ 64 MiB
 
- free bitmap: balloc/bfreeinkernel/fs.c
      - track free/in-use data blocks
- bit i in the free bitmap indicates whether data block i is free
- questions
        
          - how to check if data block i is free
- how to change it from free to in-use
 
 
crash safety
  - recap: a file system with file /funder the root directory
      - ino 1: root directory /
          - inode contains data block addresses for the content
- data block contains a struct dirent(2, “f”)
 
- ino 2: /f(empty file)
 
      inodes
      +---------------+
      | inode 1:      |
      | type = T_DIR  |
      | sz = 48       |
      | addrs[0] = 0  | --+
      | ...           |   |
      +---------------+   |
  +-> | inode 2:      |   |
  |   | type = T_FILE |   |
  |   | sz = 0        |   |
  |   | ...           |   |
  |   ~~~~~~~~~~~~~~~~~   |
  |                       |
  |   data blocks         |
  |   +---------------+   |
  |   | data block 0: | <-+
  |   | ...           |
  +-- | (2, "f")      |
      | ...           |
      ~~~~~~~~~~~~~~~~~
  - example: how to create a new file /newf?
      - require updating multiple places on disk:
        
          - inode 1: sz(accounting for the size of a newstruct dirent)
- inode 3: new inode for /newf
- data block 0: a new entry (3, “newf”)
 
- which write should go first?
        
          - goal: crash safety; the file system must be “usable” after reboot and recovery
- assumptions:
            
              - the system can crash (poweroff) at any arbitrary point
- the disk may miss writes, but won’t write aribtrary data
 
- what if we write inode 3 first, and the system crashes after that (before we write inode 1 or data block 0)? leaking inode 3
- what if we write inode 1 and data block 0 first, and the system crashes after that (before we write inode 3)? dangling pointer
- what’s the maximum number of disk writes required for creating a new file?
 
 
      inodes
      +---------------+
      | inode 1:      |
      | type = T_DIR  |
      | sz = 64       |
      | addrs[0] = 0  | --+
      | ...           |   |
      +---------------+   |
  +-> | inode 2:      |   |
  |   | type = T_FILE |   |
  |   | sz = 0        |   |
  |   | ...           |   |
  |   +---------------+   |
+---> | inode 3:      |   |
| |   | type = T_FILE |   |
| |   | sz = 0        |   |
| |   | ...           |   |
| |   ~~~~~~~~~~~~~~~~~   |
| |                       |
| |   data blocks         |
| |   +---------------+   |
| |   | data block 0: | <-+
| |   | ...           |
| +-- | (2, "f")      |
+---- | (3, "newf")   |
      | ...           |
      ~~~~~~~~~~~~~~~~~
  - problem: multi-step operations
    
      - need to update multiple blocks atomically
- approaches
        
          - best effort repair: fsck
- redundancy: replications, coding
- xv6: write-ahead logging/journaling
 
 
- write-head logging/journaling
    
      - log everything you intend to do before making any destructive changes
- mark “done” in the log
- make the changes
- reset the log after changes are done
- if crash - recover from log upon reboot
        
          - “done” in log: redo writes from the log
- no “done” in log: simply discard the log
 
- xv6: begin_op/ … /end_op(kernel/log.c)
 
- example: write to block address a1with datav1anda2withv2
      - begin_op(); write(a1, v1); write(a2, v2); end_op();
- what if the system crashes after each step?
- what if the system crashes during recovery?
 
log header     | payload (LOGSIZE blocks)
+--------------+-------------------------
| n=0          | ...
+--------------+-------------------------
write data v1 and v2 (log is still considered empty given n=0 in header):
+--------------+----+----+---------------
| n=0          | v1 | v2 | ...
+--------------+----+----+---------------
update log header with # of writes and addresses (log is considered complete):
+--------------+----+----+---------------
| n=2 (a1, a2) | v1 | v2 | ...
+--------------+----+----+---------------
... write a1 with v1 and a2 with v2 ...
reset log:
+--------------+-------------------------
| n=0          | ...
+--------------+-------------------------
Q&A
  - real-world I/O stack is complex
    
      - Linux
- more sophisticated in cloud systems
 
- syscalls (POSIX)
    
      - semantics are tricky
- example: update a new file
        
          - why not just directly writing to file- can leave an incomplete new file if crash in the middle
- does this guarantee that users will see either the old or the new file?
 
- POSIX is underspecified
        
          - the behavior may vary across file systems
- safe fix: insert fsync(fd)beforeclose
- maybe even fsyncthe directory
 
- if you are interested, learn more about crash consistency
and O_TMPFILE (Linux)
 
int fd = open("file.tmp", ...);
write(fd, newdata, newdatasize);
close(fd);
rename("file.tmp", "file");
  - journaling
    
      - ideas
        
          - step 1: write a “todo” list before destructive updates
- step 2: replay the “todo list”
- can you use an “undo” list instead?
 
- examples: ext3/ext4 (Linux), NTFS (Windows), HFS+ (macOS)
- example: LevelDB,
key-value store - not a file system, but share many ideas
- downsides
        
          - write twice (once in the journal and once for the actual data)
- performance (log)
- how about running LevelDB on top of ext4?
 
 
- copy-on-write (COW)
    
      - don’t do destructive updates; reduce updates to one single write
- example: log-structured file systems
        
      
- example: COW btrees
- examples: Btrfs (Linux), ReFS (Windows),
APFS (macOS)