Lecture: File system
preparation
overview
- goals
- resource naming & sharing
- data persistence across reboots
- user-space programming interfaces
- review CSE 333 on low-level I/O
- file-system syscalls:
open
/close
/read
/write
/fsync
/link
/unlink
/…
- mmap syscalls: memory rw/
mmap
/munmap
/msync
- questions
- why fd? how about syscalls using file names only?
- given a fd, can you get the corresponding file name?
- can multiple directories “contain” the same file?
- what happens on disk after running the following code? you should be able to describe the details after this/next week
storage devices
- storage medium
- disk controller
- connect storage devices to the bus
- may perform virtual-to-physical translation (e.g., Flash Translation Layer)
- examples
- device abstraction: block device
- terminology: block vs sector
data ← read(blockno)
write(blockno, data)
flush
(ignore for now)
I/O stacks
- how to provide the file-system API on top of storage devices?
- real-world I/O stack:
Linux
- file read on xv6:
read()
→ sys_read()
→ fileread()
→ readi()
→ bread()
→ virtio_disk_rw()
- I/O stack: see Figure 8.1: Layers of the xv6 file system of the xv6 book
- per-process
fd
(kernel/sysfile.c
)
- file layer using
struct file
(kernel/file.c
)
- fs layer using
struct inode
(kernel/fs.c
)
- block I/O (
kernel/bio.c
)
- disk driver (
kernel/virtio_disk.c
)
data structures
- questions
- how to locate file
/f
and read its content? (in lecture)
- how to append “hello” to
/f
?
- how to overwrite
/f
with new data “world”?
- how to create a new file
/newf
?
- will cover the first question in lecture; do the rest yourself
- xv6 disk layout (
kernel/fs.h
):
- split the disk into 1-KiB blocks
[ ... | superblock | log | inode blocks | free bitmap | data blocks ]
- ignore
log
for now (next lecture)
- superblock:
struct superblock
in kernel/fs.h
- metadata about the entire file system
- often located at a pre-defined location (xv6: block 1)
- contain pointers to other parts of the file system
- inode:
struct dinode
in kernel/fs.h
(see Figure 8.3: The representation of a file on disk of the xv6 book)
- file metadata: type, size, (permission, time, …), packed in the region of inode blocks
- inode number (ino): index into inode blocks
- unique name for files on disk
- ino 1 usually refers to the root directory
addrs
contain pointers to data blocks: 12 direct blocks and 1 (singly) indirect block
- why indirect blocks? hold more block addresses
- how many block addresses does one indirect block hold? 1024 / 4 = 256
- how to read the n-th block of an inode: see
readi
/bmap
in kernel/fs.c
- n = 0..11: read block at address
->addrs[n]
- n >= 12: load the block address from the indirect block, and read block at that address
- types of inodes
- regular file: inode + data blocks
- directory: a special type of file with predefined format
- content:
struct dirent
s - (inode number, name) pairs
- map names to inode numbers
- look at how
user/ls.c
lists files under a directory
- q: how to locate & read
/f
?
- locate the root directory
/
:
- start from superblock: block 1 (well-known address)
- find inode blocks from the superblock
- find the inode of the root directory (ino 1) from inode blocks
- locate
/f
:
- read the data blocks of the root directory
- find the
struct dirent
where the name is f
- now we have the ino of /f
- find the inode of
/f
using the ino from the previous step
- read
/f
:
- convert offset to in-file block address
- read the corresponding data block given a block address
- q: what’s the maximum file size on xv6? 12 + 256 = 268 KiB
- q: how can we increase the limit on file size?
- can use doubly indirect blocks
- say we change each inode to hold 11 direct blocks, 1 singly indirect block, and 1 doubly indirect block:
11 + 256 + 256 * 256 = 65,803 KiB ≈ 64 MiB
- free bitmap:
balloc
/bfree
in kernel/fs.c
- track free/in-use data blocks
- bit i in the free bitmap indicates whether data block i is free
- questions
- how to check if data block i is free
- how to change it from free to in-use
crash safety
- recap: a file system with file
/f
under the root directory
- ino 1: root directory
/
- inode contains data block addresses for the content
- data block contains a
struct dirent
(2, “f”)
- ino 2:
/f
(empty file)
- example: how to create a new file
/newf
?
- require updating multiple places on disk:
- inode 1:
sz
(accounting for the size of a new struct dirent
)
- inode 3: new inode for
/newf
- data block 0: a new entry (3, “newf”)
- which write should go first?
- goal: crash safety; the file system must be “usable” after reboot and recovery
- assumptions:
- the system can crash (poweroff) at any arbitrary point
- the disk may miss writes, but won’t write aribtrary data
- what if we write inode 3 first, and the system crashes after that (before we write inode 1 or data block 0)? leaking inode 3
- what if we write inode 1 and data block 0 first, and the system crashes after that (before we write inode 3)? dangling pointer
- what’s the maximum number of disk writes required for creating a new file?
- problem: multi-step operations
- need to update multiple blocks atomically
- approaches
- best effort repair: fsck
- redundancy: replications, coding
- xv6: write-ahead logging/journaling
- write-head logging/journaling
- log everything you intend to do before making any destructive changes
- mark “done” in the log
- make the changes
- reset the log after changes are done
- if crash - recover from log upon reboot
- “done” in log: redo writes from the log
- no “done” in log: simply discard the log
- xv6:
begin_op
/ … /end_op
(kernel/log.c
)
- example: write to block address
a1
with data v1
and a2
with v2
begin_op(); write(a1, v1); write(a2, v2); end_op();
- what if the system crashes after each step?
- what if the system crashes during recovery?
Q&A
- real-world I/O stack is complex
- Linux
- more sophisticated in cloud systems
- syscalls (POSIX)
- semantics are tricky
- example: update a new file
- why not just directly writing to
file
- can leave an incomplete new file if crash in the middle
- does this guarantee that users will see either the old or the new file?
- POSIX is underspecified
- the behavior may vary across file systems
- safe fix: insert
fsync(fd)
before close
- maybe even
fsync
the directory
- if you are interested, learn more about crash consistency
and O_TMPFILE (Linux)
- journaling
- ideas
- step 1: write a “todo” list before destructive updates
- step 2: replay the “todo list”
- can you use an “undo” list instead?
- examples: ext3/ext4 (Linux), NTFS (Windows), HFS+ (macOS)
- example: LevelDB,
key-value store - not a file system, but share many ideas
- downsides
- write twice (once in the journal and once for the actual data)
- performance (log)
- how about running LevelDB on top of ext4?
- copy-on-write (COW)
- don’t do destructive updates; reduce updates to one single write
- example: log-structured file systems
- example: COW btrees
- examples: Btrfs (Linux), ReFS (Windows),
APFS (macOS)