Lecture: virtual machines
preparation
OS virtualization recap
- review OS organization
- virtual memory
- virtual CPU/time
- virtual file system
- file descriptor & name space
- a key scheme: naming
- (pid, va) -> pa
- (file, offset) -> disk address
- other naming examples: DNS, Linux namespace (Docker etc.)
- what if we want to run multiple OSes?
- 60s: IBM
- 90s: VMware for x86
overview
- terminology
- goals
- fidelity: (almost) identical to execution on hw
- performance: low overhead
- safety: isolation, mutiplexing
- strawman plans
- directly run guest kernels on hw: safety?
- emulation (e.g., QEMU/Bochs): performance?
- run guest kernels in user mode: what can go wrong
- classically virtualizable ISA
- trap-and-emulate: recall changing div by zero to return 42
- sensitive instructions ⊆ privileged instruction (Popek-Goldberg)
- x86 (before vmx/svm)
- not virtualizable: 17 sensitive, unprivileged instructions
- example: pushf/popf/iret behave differently in user mode
- except for v8086
x86 virtualization
- what machine state to virtualize: CPL, GDT, IDT, page table, control registers, MSRs, etc.
- virtualize x86 without vmx/svm
- idea: modify the guest code to avoid the 17 instructions
- binary translation (e.g., VMware)
- goal: replace the 17 instructions with trapping ones (e.g., int $3)
- challenge: how to do this efficiently
- paravirtualization (e.g., Xen)
- originally from the Denali isolation kernel (OSDI 2002)
- replace the 17 instructions in guest kernels with hypercalls
- hardware support: Intel’s VT-x (vmx) / AMD’s AMD-V (svm)
- host: root mode (sometimes called ring -1)
- guest: non-root mode
- VM entry: guest to host
- VM exit: host to guest
- overhead: 4000 (older)–500(newer) cycles (syscalls: 60-70 cycles)
- virtual memory
- shadow page tables
- simulate
%cr3
in software
- write-protect guest’s page table pages
- downside: performance
- extended page tables (EPT) / nested page tables (NPT): guest pa -> host pa
- devices
- take a look at the lvisor exercises from CSE 481A if you are interested
Dune
- goal
- process absraction for privileged instructions
- Dune process mode
- ring 0, non-root
- use hypercalls as syscalls
- applications
- fast faults
- CPU delivers page faults to user space: compared to JOS lab4
- how about division by zero - should the kernel be involved
- direct access to page tables, etc.
- example: GC
- use dirty bits to track if memory has been touched
- better page table management
- better TLB control
- performance: pros and cons