Lecture: Virtual memory
preparation
read the xv6 book: §2, Page tables
administrivia
how’s lab util?
warmup, no kernel code
your grade is what make grade
says. no hidden tests, style points, etc.
don’t worry about the 10% “exercises” for now - they are for paper questions later this quarter
lab alloc (physical memory management)
physical memory management (kernel/kalloc.c
):
a free list of physical pages, by embedding a struct run
into each free page
slab allocation: feel free to design your own data structures for allocating small objects
attend sections and/or refer to online sources & textbook
don’t get discouraged by tech articles/papers; learn to skim
project:
lab mmap/net or your own project;
feel free to talk to us about the scope; start early (forming groups, etc.)
overview
virtual memory
popular in modern OSes for isolation (and more)
example: two user processes write to the same virtual address (e.g., pointers) 0x1000
the two writes go to different physical addresses 0x80001000
and 0x80002000
isolation is achieved through naming: one process cannot name a memory address private owned by another process
today’s focus: how is this achieved?
workflow
both VA (virtual address) and PA (physical address) spaces
divided into fixed-size chunks (e.g., 4KB) called pages
hardware (CPU): perform VA → PA translation based on a data structure called page table
software (OS): set up page tables for translation
isolate user processes from each other: switch to per-process page tables
isolate kernel memory from user: switch to kernel page table (esp. post-Meltdown)
review virtual memory from CSE 351:
motivation ,
TLB
MMU (memory management unit)
given VA (e.g., load/store instructions), consult page table to translate to PA
translation failure (due to no mapping or permission): raise an exception (page fault )
how to make VA → PA lookup faster
caching: TLB (translation lookaside buffer)
larger pages, TLB tagging/selective flushing
hardware support (RISC-V)
registers
satp
(Supervisor Address Translation and Protection) register
holds the (physical) address of the page-table root
stval
(Supervisor Trap Value) register
holds the page-fault (virtual) address
instructions
use csrw
(control/status register write) to set satp
use sfence.vma
to order changes to page table
other architectures have similar registers (e.g., %cr3
and %cr2
on x86)
software support (xv6)
kernel page table: kernel_pagetable
(kernel/vm.c
)
per-process page table: pagetable
in struct proc
(kernel/proc.h
)
kernel uses kernel page table upon booting (kvminithart()
in kernel/vm.c
)
kernel saves kernel page table in p->trapframe->kernel_satp
(usertrapret()
in kernel/trap.c
)
kernel->user: switch to per-process page table (userret
in kernel/trampoline.S
)
user->kernel: switch to kernel page table (uservec
in kernel/trampoline.S
)
page table
review virtual memory from CSE 351:
address translation ,
multi-level page tables
satp
: Figure 4.12 of the RISC-V privileged spec
MODE: xv6 uses 8 (sv39, three-level page table), see Table 4.3
ASID: ignore (set to 0)
PPN: physical page number of the page-table root
example: #define MAKE_SATP(pagetable) ((8L << 60) | (((uint64)pagetable) >> 12))
in xv6’s kernel/riscv.h
page table (sv39): Figure 3.2 of the xv6 book
three-level page table
64-bit VA (→ 56-bit PA)
top 25 VA bits unused
next 27 VA bits used to index into page table (9 bits for each level)
bottom 12 bits untranslated and copied to PA
each level (one page) contains 512 page-table entries (PTEs)
PTE = 44-bit PPN (for the next level) + 10-bit permission flags
V/R/W/X/U: valid/read/write/execute/user
A/D: accessed/dirty
see PTE*
macros in xv6’s kernel/riscv.h
questions
why not just one-level page table?
2^27 * 8 = 1GB per page table (per process);
multiple levels efficiently encode sparse address space
can a page table have more levels?
it makes sense for user processes to use virtual addresses,
but why does the kernel also use virtual addresses?
page faults
what happens if address translation VA → PA fails?
no valid entries in the page table for translating VA
cpu: raise an exception - more on this next week
xv6: terminate with error messages (this lecture) or do something smart (labs lazy and cow )
example in user space: add *(volatile char *)0xdeadbeef = 0;
to main
(user/echo.c
)
$ echo
usertrap(): unexpected scause 0x000000000000000f (store/AMO page fault) pid=3
sepc=0x0000000000000016 stval=0x00000000deadbeef
example in kernel: add *(volatile char *)0xdeadbeef = 0;
to exec
(kernel/exec.c
);
can use make qemu-trace to print out function names and offsets in stack trace
xv6 kernel is booting
hart 2 starting
hart 1 starting
scause 0x000000000000000f (store/AMO page fault)
sepc=0x0000000080004cba stval=0x00000000deadbeef
PANIC: kerneltrap
[<0x00000000800005ac>] panic+0x46/0x62
[<0x0000000080002af2>] kerneltrap+0xb4/0xd8
[<0x0000000080005ca4>] kernelvec+0x44/0x90
[<0x0000000080005b20>] sys_exec+0xe0/0x110
[<0x0000000080002cca>] syscall+0x3e/0x6c
[<0x0000000080002996>] usertrap+0x60/0x108
useful information for debugging: refer to the RISC-V privileged spec
scause
: is this page fault caused by memory load, memory store, or instruction fetch
sepc
: which instruction caused the page fault
stval
: which memory address caused the page fault
similar to segmentation fault / kernel panic on Linux
can use sepc
to locate source file (e.g., search for the value in user/echo.asm
or kernel/kernel.asm
)
what page tables does xv6 set up to make the page faults happen?
address spaces in xv6
kernel address space: Figure 3.3 of the xv6 book
mostly identity mapping
except for the trampoline page and the kernel stack - why not just identity mapping?
read kvminit()
(kernel/vm.c
) and understand how kvmmap()
is implemented
can be quite different on other architectures/OS kernels - see Linux/x86-64
process address space: Figure 3.4 of the xv6 book
PTE_U
set for user pages
sbrk
for growing the heap: sys_sbrk
, growproc
, uvmalloc
, mappages
share the trampoline page with the kernel
exercise: implement vmprint
to print page table (lab lazy )
use it to show kernel and process address spaces
example
insert a call to vmprint
in kvminit()
, right after the first kvmmap
use it to translate VA 0x1000_0000
- what’s the resulting PA?
page table 0x0000000087fff000
..0: pte 0x0000000021fff801 pa 0x0000000087ffe000
.. ..128: pte 0x0000000021fff401 pa 0x0000000087ffd000
.. .. ..0: pte 0x0000000004000007 pa 0x0000000010000000
applications
more on TLB
hardware vs. software-managed TLB
TLB shootdown in multiprocessors
two-dimensional paging: see figure 1, Accelerating Two-Dimensional Page Walks for Virtualized Systems
how to reduce the overhead?
protect against stack overflow
see Michael Barr’s Bookout v. Toyota , “Toyota’s major stack mistakes”
trick: put a non-mapped, guard page right below stack
xv6: kernel/memlayout.h
implement null pointer dereference exception
how would you implement this for Java, say obj->field
trick: put a non-mapped page at VA zero
useful for catching program bugs
limitations?
limited physical memory
applications need more memory than physical memory
early days: two floppy drives
strawman: applications store part of state to disk and load back later
hard to write applications
virtual memory: offer the illusion of a large, continuous memory
swap space: OS pages out some pages to disk transparently
distributed shared memory: access other machines’ memory across network
grow stack on demand (lab lazy )
sbrk
currently allocates memory upon invcation
allocate memory lazily (upon page fault)
copy-on-write fork (lab cow )
strawman fork: copy all pages from parent to child
observation: child and parent share most of the data
mark pages as copy-on-write
make a copy on page fault
other sharing
multiple guest OSes running inside the same hypervisor
shared objects: .so
/.dll
files
memory-mapped files (lab mmap )
mmap()
: map files, read/write files like memory
simple programming interface
when to page-in/page-out content?
avoid data copying: send an mmaped file to network
compare to using read
/write
no data transfer from kernel to user
Q&A
can we simply return from main()
in xv6 user-space programs without calling exit()
?
no proper crt0 setup, no caller to return to
one possible fix:
diff --git a/Makefile b/Makefile
index 44ad5f2..679657c 100644
--- a/Makefile
+++ b/Makefile
@@ -123,13 +123,16 @@ $U/initcode: $U/initcode.S
tags: $(OBJS) _init
etags *.S *.c
-ULIB = $U/ulib.o $U/usys.o $U/printf.o $U/umalloc.o
+ULIB = $U/crt0.o $U/ulib.o $U/usys.o $U/printf.o $U/umalloc.o
_%: %.o $(ULIB)
- $(LD) $(LDFLAGS) -N -e main -Ttext 0 -o $@ $^
+ $(LD) $(LDFLAGS) -N -e _start -Ttext 0 -o $@ $^
$(OBJDUMP) -S $@ > $*.asm
$(OBJDUMP) -t $@ | sed '1,/SYMBOL TABLE/d; s/ .* / /; /^$$/d' > $*.sym
+$U/crt0.o : $U/crt0.S
+ $(CC) $(CFLAGS) -I kernel -c -o $U/crt0.o $U/crt0.S
+
$U/usys.S : $U/usys.pl
perl $U/usys.pl > $U/usys.S
diff --git a/user/crt0.S b/user/crt0.S
new file mode 100644
index 0000000..a132c9b
--- /dev/null
+++ b/user/crt0.S
@@ -0,0 +1,10 @@
+#include "syscall.h"
+
+.globl _start
+_start:
+ call main
+
+exit:
+ li a7, SYS_exit
+ ecall
+ jal exit