From: Brian Milnes (brianmilnes_at_qwest.net)
Date: Mon Mar 01 2004 - 14:33:56 PST
Disco: Running Commodity Operating Systems on Scalable Multiprocessors -
Bugnion, Devine and Rosenbloom
This is an unbelievably meaty paper full of many subtle implementation
discussions. I'm not sure that I'm fully getting the details of this virtual
machine and disk emulation correctly and would like you to walk some
examples in class. This also does not produce a short paper review.
The authors propose using virtual machine monitors to simply the
construction of the operating systems for large shared memory
multiprocessors. Their system DISCO minimizes some of the costs of virtual
machines by sharing a buffer cache and allows their coupling with TCP/IP or
NFS. NFS seems a pretty poor choice as it produces lock ups in every kernel
that I've used it with. Their uniprocessor overhead is at most 16% and they
speed up some workloads by 40% presumably by improving scalability of the
operating system and hide the NUMA-ness of these systems.
Their approach has a whole host of advantages and disadvantages.
They allow fault containment, running different operating systems
simultaneously, running different versions simultaneously, hiding the NUMA
aspects, running a lighter weight operating system. They require more
memory, different exception processing, privilege instruction and IO device
remapping, a lack of coherent scheduling and memory management policy and
additional communications overhead.
DISCO virtualizes MIPS R10000 chips by emulating instructions,
the MMU and the trap architecture. They use load and stores on special
addresses to optimize frequent kernel operations. They dynamically relocate
pages, virtualize disks and provide a large packet virtual network between
machines. DISCO is implemented as a shared memory threaded program with a
mere 13K lines of cache careful low synchronization code.
They virtualized CPUs with direct execution and trap into their kernel to
emulate faults. They operate DISCO in a physical memory mode, using the TLB
directly to avoid emulation and flush the TLB when running a new processor
to allow them to use its process tags. They minimize this overhead with a
per virtual machine TLB cache.
DISCO measures frequency of access to pages and moves hot pages
to the physical machine running their accessing virtual machine. They
support virtual IO devices by requiring a monitor call when reading or
writing data to a driver and by catching DMA accesses. They partition disks
for each machine but really use an shared buffer cache implemented with a
B-tree. They mark shared disk blocks COW and TLB map them into a virtual
machine to achieve a sharing very similar to a single operating system.
They define a virtual MTU-less network driver that they
implement using memory mapping. They respect the alignment of all data and
allow any data on a read only page to be mapped between processors. This is
a very cool idea that produces transfer rates at near TLB miss rate but
required some alignment changes to Irix's mbuf mechanism.
The made almost all of their changes at the hardware abstraction
layer of Irix. The HAL layer was adjusted to map traps into accesses to
privileged memory and to rewrite trap code to work this way dynamically. The
HAL layer also provided hints to DISCO on page frees and idle CPUs. I
wonder what happened when they tried this on Linux which does not have a
clear HAL.
They measured DISCO on a uniprocessor Irix machine and a
parallel operating systems simulator. The achieved between 3% and 16%
slowdown on four tasks and note that a restructuring of IRIX could reduce
some of these overheads. They suffer slightly in memory allocation to
virtual operating systems but can achieve up to 60% speedup when the virtual
memory system locks of Irix are problematic.
This archive was generated by hypermail 2.1.6 : Mon Mar 01 2004 - 14:33:59 PST