Lecture: Virtual memory

preparation

read the xv6 book: §2, Page tables

administrivia

how’s lab util?
- warmup, no kernel code
- your grade is what make grade says. no hidden tests, style points, etc.
- don’t worry about the 10% “exercises” for now - they are for paper questions later this quarter
lab alloc (physical memory management)
- physical memory management (kernel/kalloc.c): a free list of physical pages, by embedding a struct run into each free page
- slab allocation: feel free to design your own data structures for allocating small objects
- attend sections and/or refer to online sources & textbook
- don’t get discouraged by tech articles/papers; learn to skim
project: lab mmap/net or your own project; feel free to talk to us about the scope; start early (forming groups, etc.)

overview

virtual memory
- popular in modern OSes for isolation (and more)
- example: two user processes write to the same virtual address (e.g., pointers) 0x1000
  - the two writes go to different physical addresses 0x80001000 and 0x80002000
  - isolation is achieved through naming: one process cannot name a memory address private owned by another process
- today’s focus: how is this achieved?
workflow
- both VA (virtual address) and PA (physical address) spaces divided into fixed-size chunks (e.g., 4KB) called pages
- hardware (CPU): perform VA → PA translation based on a data structure called page table
- software (OS): set up page tables for translation
  - isolate user processes from each other: switch to per-process page tables
  - isolate kernel memory from user: switch to kernel page table (esp. post-Meltdown)
review virtual memory from CSE 351: motivation, TLB
- MMU (memory management unit)
  - given VA (e.g., load/store instructions), consult page table to translate to PA
  - translation failure (due to no mapping or permission): raise an exception (page fault)
- how to make VA → PA lookup faster
  - caching: TLB (translation lookaside buffer)
  - larger pages, TLB tagging/selective flushing
hardware support (RISC-V)
- registers
  - satp (Supervisor Address Translation and Protection) register holds the (physical) address of the page-table root
  - stval (Supervisor Trap Value) register holds the page-fault (virtual) address
- instructions
  - use csrw (control/status register write) to set satp
  - use sfence.vma to order changes to page table
- other architectures have similar registers (e.g., %cr3 and %cr2 on x86)
software support (xv6)
- kernel page table: kernel_pagetable (kernel/vm.c)
- per-process page table: pagetable in struct proc (kernel/proc.h)
- kernel uses kernel page table upon booting (kvminithart() in kernel/vm.c)
- kernel saves kernel page table in p->trapframe->kernel_satp (usertrapret() in kernel/trap.c)
- kernel->user: switch to per-process page table (userret in kernel/trampoline.S)
- user->kernel: switch to kernel page table (uservec in kernel/trampoline.S)

page table

review virtual memory from CSE 351: address translation, multi-level page tables
satp: Figure 4.12 of the RISC-V privileged spec
- MODE: xv6 uses 8 (sv39, three-level page table), see Table 4.3
- ASID: ignore (set to 0)
- PPN: physical page number of the page-table root
- example: #define MAKE_SATP(pagetable) ((8L << 60) | (((uint64)pagetable) >> 12)) in xv6’s kernel/riscv.h
page table (sv39): Figure 3.2 of the xv6 book
- three-level page table
- 64-bit VA (→ 56-bit PA)
  - top 25 VA bits unused
  - next 27 VA bits used to index into page table (9 bits for each level)
  - bottom 12 bits untranslated and copied to PA
- each level (one page) contains 512 page-table entries (PTEs)
  - PTE = 44-bit PPN (for the next level) + 10-bit permission flags
  - V/R/W/X/U: valid/read/write/execute/user
  - A/D: accessed/dirty
- see PTE* macros in xv6’s kernel/riscv.h
questions
- why not just one-level page table?
  - 2^27 * 8 = 1GB per page table (per process);
  - multiple levels efficiently encode sparse address space
- can a page table have more levels?
  - see RISC-V’s sv48 in the privileged spec
  - x86’s 4-level page table, 5-level page table
- it makes sense for user processes to use virtual addresses, but why does the kernel also use virtual addresses?

page faults

what happens if address translation VA → PA fails?
- no valid entries in the page table for translating VA
- cpu: raise an exception - more on this next week
- xv6: terminate with error messages (this lecture) or do something smart (labs lazy and cow)
example in user space: add *(volatile char *)0xdeadbeef = 0; to main (user/echo.c)

$ echo
usertrap(): unexpected scause 0x000000000000000f (store/AMO page fault) pid=3
            sepc=0x0000000000000016 stval=0x00000000deadbeef

example in kernel: add *(volatile char *)0xdeadbeef = 0; to exec (kernel/exec.c); can use make qemu-trace to print out function names and offsets in stack trace

xv6 kernel is booting

hart 2 starting
hart 1 starting
scause 0x000000000000000f (store/AMO page fault)
sepc=0x0000000080004cba stval=0x00000000deadbeef
PANIC: kerneltrap
[<0x00000000800005ac>] panic+0x46/0x62
[<0x0000000080002af2>] kerneltrap+0xb4/0xd8
[<0x0000000080005ca4>] kernelvec+0x44/0x90
[<0x0000000080005b20>] sys_exec+0xe0/0x110
[<0x0000000080002cca>] syscall+0x3e/0x6c
[<0x0000000080002996>] usertrap+0x60/0x108

useful information for debugging: refer to the RISC-V privileged spec
- scause: is this page fault caused by memory load, memory store, or instruction fetch
- sepc: which instruction caused the page fault
- stval: which memory address caused the page fault
- similar to segmentation fault / kernel panic on Linux
- can use sepc to locate source file (e.g., search for the value in user/echo.asm or kernel/kernel.asm)
what page tables does xv6 set up to make the page faults happen?

address spaces in xv6

kernel address space: Figure 3.3 of the xv6 book
- mostly identity mapping
- except for the trampoline page and the kernel stack - why not just identity mapping?
- read kvminit() (kernel/vm.c) and understand how kvmmap() is implemented
- can be quite different on other architectures/OS kernels - see Linux/x86-64
process address space: Figure 3.4 of the xv6 book
- PTE_U set for user pages
- sbrk for growing the heap: sys_sbrk, growproc, uvmalloc, mappages
- share the trampoline page with the kernel
exercise: implement vmprint to print page table (lab lazy)
- use it to show kernel and process address spaces
- example
  - insert a call to vmprint in kvminit(), right after the first kvmmap
  - use it to translate VA 0x1000_0000 - what’s the resulting PA?

page table 0x0000000087fff000
 ..0: pte 0x0000000021fff801 pa 0x0000000087ffe000
 .. ..128: pte 0x0000000021fff401 pa 0x0000000087ffd000
 .. .. ..0: pte 0x0000000004000007 pa 0x0000000010000000

applications

more on TLB
- hardware vs. software-managed TLB
- TLB shootdown in multiprocessors
- two-dimensional paging: see figure 1, Accelerating Two-Dimensional Page Walks for Virtualized Systems
- how to reduce the overhead?
protect against stack overflow
- see Michael Barr’s Bookout v. Toyota, “Toyota’s major stack mistakes”
- trick: put a non-mapped, guard page right below stack
- xv6: kernel/memlayout.h
implement null pointer dereference exception
- how would you implement this for Java, say obj->field
- trick: put a non-mapped page at VA zero
  - useful for catching program bugs
  - limitations?
limited physical memory
- applications need more memory than physical memory
  - early days: two floppy drives
  - strawman: applications store part of state to disk and load back later
  - hard to write applications
- virtual memory: offer the illusion of a large, continuous memory
  - swap space: OS pages out some pages to disk transparently
  - distributed shared memory: access other machines’ memory across network
grow stack on demand (lab lazy)
- sbrk currently allocates memory upon invcation
- allocate memory lazily (upon page fault)
copy-on-write fork (lab cow)
- strawman fork: copy all pages from parent to child
- observation: child and parent share most of the data
  - mark pages as copy-on-write
  - make a copy on page fault
- other sharing
  - multiple guest OSes running inside the same hypervisor
  - shared objects: .so/.dll files
memory-mapped files (lab mmap)
- mmap(): map files, read/write files like memory
- simple programming interface
- when to page-in/page-out content?
- avoid data copying: send an mmaped file to network
  - compare to using read/write
  - no data transfer from kernel to user

Q&A

can we simply return from main() in xv6 user-space programs without calling exit()?
- no proper crt0 setup, no caller to return to
- one possible fix:

diff --git a/Makefile b/Makefile
index 44ad5f2..679657c 100644
--- a/Makefile
+++ b/Makefile
@@ -123,13 +123,16 @@ $U/initcode: $U/initcode.S
 tags: $(OBJS) _init
 	etags *.S *.c
 
-ULIB = $U/ulib.o $U/usys.o $U/printf.o $U/umalloc.o
+ULIB = $U/crt0.o $U/ulib.o $U/usys.o $U/printf.o $U/umalloc.o
 
 _%: %.o $(ULIB)
-	$(LD) $(LDFLAGS) -N -e main -Ttext 0 -o $@ $^
+	$(LD) $(LDFLAGS) -N -e _start -Ttext 0 -o $@ $^
 	$(OBJDUMP) -S $@ > $*.asm
 	$(OBJDUMP) -t $@ | sed '1,/SYMBOL TABLE/d; s/ .* / /; /^$$/d' > $*.sym
 
+$U/crt0.o : $U/crt0.S
+	$(CC) $(CFLAGS) -I kernel -c -o $U/crt0.o $U/crt0.S
+
 $U/usys.S : $U/usys.pl
 	perl $U/usys.pl > $U/usys.S
 
diff --git a/user/crt0.S b/user/crt0.S
new file mode 100644
index 0000000..a132c9b
--- /dev/null
+++ b/user/crt0.S
@@ -0,0 +1,10 @@
+#include "syscall.h"
+
+.globl _start
+_start:
+ call main
+
+exit:
+ li a7, SYS_exit
+ ecall
+ jal exit