Lecture: File system
preparation
administrivia
- proposal due this Wednesday
- lab alarm: don’t directly jump to a user-controlled address from the kernel - why?
- lab fs is out
overview
- goals
- data persistence across reboots
- resource naming & sharing
- user-space programming interfaces
- file-system syscalls:
open
/close
/read
/write
/fsync
/link
/unlink
/…
- mmap syscalls: memory rw/
mmap
/munmap
/msync
- others: async I/O (e.g., Windows)
- questions
- why fd? how about syscalls using file names only?
- given a fd, can you get the corresponding file name?
- can multiple directories “contain” the same file?
- what happens on disk after running the following code?
- you should be able to describe the details after this lecture
I/O stacks
- real-world I/O stack:
Linux
- xv6 stack:
- per-process
fd
(kernel/sysfile.c
)
- file layour using
struct file
(kernel/file.c
)
- fs layers using
struct inode
(kernel/fs.c
)
- block I/O (
kernel/bio.c
)
- disk driver (
kernel/virtio_disk.c
)
- walkthrough:
write()
→ sys_write()
→ filewrite()
→ writei()
→ bread()
/bwrite()
→ virtio_disk_rw()
- what does
argfd()
do in sys_write()
?
- how does
filewrite()
call different functions depending on the fd type?
block devices
- storage medium
- disk controller
- connect storage devices to the bus
- may perform virtual-to-physical translation (e.g., Flash Translation Layer)
- examples
- device abstraction: block device
- terminology: block vs sector
data ← read(blockno)
write(blockno, data)
flush
(ignore for now)
trim
(ignore for now)
data structures
- questions
- how to locate
/d/f
: how to locate the root directory, directory d
, and file f
- how to append “hello” to
/d/f
- how to overwrite
/d/f
with new data “world”
- how to create a new file
/newfile
- xv6 disk layout:
[ ... | superblock | log | inode blocks | free bitmap | data blocks ]
- superblock:
struct superblock
in kernel/fs.h
- metadata about the entire file system
- often located at a pre-defined location (xv6: block 1)
- contain pointers to other parts of the file system
- free bitmap:
balloc
/bfree
in kernel/fs.c
- track free/in-use data blocks
- bit i in the free bitmap indicates whether data block i is free
- questions
- how to check if data block i is free
- how to change it from free to in-use
- inode:
struct dinode
in kernel/fs.h
- file metadata: type, size, (permission, time, …)
- contain pointers to data blocks: 12 direct blocks and 1 (single) indirect block
- goal: map (inode number, offset) to disk block number
- why indirect blocks? compare to page tables
- questions
- does inode have a “name” field? why not?
- what’s the maximum file size
- given a file’s inode, how to read its n-th block - see
readi
/bmap
in kernel/fs.c
- file: inode + data blocks
- directory: a special “file”
- content:
struct dirent
s - (inode number, name) pairs
- map name to inode number
crash safety
- problem: multi-step operations
- need to update multiple blocks
- example: file create
- allocate a new block
- add
dirent
entry to parent directory
- write file inode
- which operations should happen first?
- what happens if the operations don’t complete?
dirent
entry points to uninitialized inode - reliablity & security
- block allocated but not used - waste
- compare to memory bugs: dangling pointers and leak
- how to tell if multi-block updates are incomplete?
- recovery: need to either continue with or undo incomplete updates
- assumptions
- the system can crash at any arbitrary point
- fail-stop: the disk may miss writes
- no arbitrary writes, no disk corruption
- power failures, hardware failures, kernel panics
- single sector/block write is often atomic (by hw)
- ordering: can two writes be reordered? - more on this later
- goal: crash safety
- the file system must be “usable” after reboot and recovery
- invariant: maintain consistent file system state
- approaches
- write-ahead logging/journaling (xv6)
- other approaches
- introduce redundancy: replications, checksums
- best effort repair
- sync metadata change +
fsck
(garbage collection)
- copy on write
- soft updates
- write-head logging/journaling
- basic idea: “transaction”
- log everything you intend to do before making any destructive changes
- mark “done” in the log
- make the changes
- delete the log
- if crash - try recovery upon reboot
- “done” in log: redo writes from the log
- no “done” in log: simply discard the log
- how about crash while replaying the log
- ensure disk won’t reorder these steps - need flush disk inbetween
- xv6:
begin_op
/log_write
/end_op
(kernel/log.c
)
- example:
begin_op(); write(a1, v1); write(a2, v2); end_op();
- what if the system crashes before
end_op()
?
- what if the system crashes after
end_op()
?
- what if the system crashes during recovery?
- quiz:
see IV. Filesystem
(solution)