From: Greg Green (ggreen_at_cs.washington.edu)
Date: Mon Mar 01 2004 - 16:32:38 PST
Disco is a virtual machine monitor aimed at running a normal operating
system on a NUMA multiple processor. The goal is to substantially
increase the performance of the operating system, without it being
aware of NUMA architecture. This would allow applications to run and
get the benefit from new architectures without having to wait for an
operating system optimized for the machine is produced.
The implementation of Disco is discussed. The R10000 processor is
virtualized. Instructions that cannot be run in a vm, are
reimplemented in the monitor. Each VM is given an abstraction of real
memory starting at offset 0. Each VM is given a set of virtual
devices, input, output, hard disks, network devices. The code runs on
the actual cpu, and only priviliged and direct access to physical
memory and devices is intercepted. Disco runs in kernel mode, it puts
the commodity operating system in supervisor mode, and applications in
user mode. Basically a 3 ring protection mechanism. There is a
scheduler that maps virtual processors to the real processors in a
round-robin algorithm.
To support the virtual physical memory, there is a clever algorithm
that intercepts TLB modifications and changes the virtual physical
address to the real memory address. If the virtual memory address
isn't in the TLB, there is another pmap structure that maintains a
mapping from physical pages to the virtual physical pages TLB
entry. So when a TLB miss is trapped, the monitor can quickly insert
the proper TLB entry from this second table.
NUMA memory management is facilitated by a dynamic page migration and
replication scheme. This scheme attempts to maintain locality between
a virtual CPU and the memory page's it is attempting to access. The
hardware gives some queues on cache misses per processor that the
monitor can use to move or copy pages. The I/O devices are virtualized
by using special device drivers that give the data to the monitor
which interacts directly with the devices.
Disk blocks are shared with all applications that need them by
copy-on-write. As long as a vm doesn't write a block, all vm's will
use the same machine page for each block. Memory is also shared using
the same machine memory page mapped into each vm. The copy-on-write
mechanism only worked on non-persistent disks. The modified block was
stored in main memory. Real writable disks can only be accessed by 1
vm at a time, so it doesn't need to be virtualized. NFS pages were
optimized so that server pages requested by a vm on the same machine
had the page mapped into memory without a copy being made.
The changes that needed to be made to IRIX 5.3 for Disco were
itemized. Some calls to the monitor were inserted so that it had a
better picture of resource utilization. Most of the changes were in
the HAL of the operating system.
The last 1/3 of the paper had experimental measurements of the
performance on a simulator. The overhead added for 5 seperate
applications is shown and analyzed. The worst overhead was for pmake
which used a large number of priviliged system instructions that had
to be emulated, this had a 16% overhead. The memory footprint is also
shown for the applications, with various numbers of VM's. There is a
measurement of performance gains created by the NUMA-awareness of
DISCO for verilog and raytracing applications.
I was very impressed by this paper. The benefits of the system seem
quite good for very little cost. There was a good selection of
applications used for metrics. I think that it would be very difficult
to get the os vendors to put in the hooks that the virtual machines
require however. Is there any consensus on that? It was interesting
how you could optimize pages for the NUMA architecture without the os
knowing about it.
-- Greg Green
This archive was generated by hypermail 2.1.6 : Mon Mar 01 2004 - 16:34:42 PST