From: Raz Mathias (razvanma_at_exchange.microsoft.com)
Date: Mon Mar 01 2004 - 17:28:23 PST
Today's paper was about a system named Disco that creates a thin
interface layer between the machine and the operating system to allow
multiple operating systems to safely run concurrently on one machine.
Disco is a Virtual Machine Monitor (VMM), meaning that it virtualizes
all the underlying physical resources of a machine.
The paper compares the side-by-side approach to running multiple
operating systems with clusters, where each individual application need
not know it is in the cluster. In the case of Disco, it's each
individual operating system that doesn't know its hardware is actually
being virtualized. I thought that this comparison placed the VMM on a
convenient continuum extending from the last batch of papers we read on
clustering. The underlying VMM system provides several advantages that
are almost completely transparent to the operating system itself. The
benefits include more effective sharing of data through a local virtual
network, machine fault tolerance through an added layer of abstraction,
the ability to run specialized operating systems that confer performance
advantages side-by-side with commodity operating systems, and the
ability to effectively utilize new hardware for which operating system
support is not yet built in (e.g. NUMA systems). Their disadvantages
involve the fact that a certain amount of memory overhead is required
and losses in performance. The performance situation here is analogous
with the one for user-level thread packages. There is not enough
information passing from the higher level down to the underlying system
for the system to effectively schedule the higher layer on the available
resources (e.g. the problem of an OS using up valuable CPU cycles by
spinning). Likewise there isn't logic in the higher level system to
necessarily account for the thrashing of resources that may occur with
the presence of other systems on the same VMM (i.e. the problem of
making effective policy decisions w.r.t. resource management).
All the resources are virtualized in the system. The VMM actually
schedules the CPU between the different virtual machines and emulates
the execution of privileged instructions. Disco creates a new layering
between the machine and "physical memory" which the virtual machine then
maps into virtual memory. Disco maintains its own physical-to-machine
address translation and flushes the TLB on switching virtual processors.
This turns out to cause large degradation in the system performance. On
NUMA hardware, Disco abstracts the underlying architecture to make it
look like a number of SMP's. An interesting aspect of the
implementation is the fact that pages are either moved or replicated
between different CPU's memories when they are touched by different
CPU's in the NUMA system; this results in effective use of the
architecture on operating systems that have no concept of NUMA at all.
Virtual I/O intercepts all calls to programmed I/O, thereby multiplexing
the underlying devices. A couple of interesting cases arise in which
Disco can actually improve sharing amongst VM's. The first is the use
of copy-on-write disks, in which pages from disk are mapped to a single
location in machine memory even though multiple VM's may actually be
holding references to them. This is especially useful for kernel and
application images which don't necessarily require that each virtual
machine maintain its own copy. Upon reading, the VMM makes a copy of
the page for that operating system. Another interesting case is that of
the virtualized network. High-speed sharing amongst different operating
systems happens through a virtual Ethernet that is simply a memory
transfer from one VM to another. This virtualization (together with the
page replication/moving) is the key that can help facilitate building
highly scalable systems on NUMA machines using commodity operating
systems that may be oblivious with regards to the underlying memory
architecture. The commodity operating system only needs to know how to
belong to a cluster of machines.
My big question for this paper is, "How can I/O multiplexing possibly
work?" In the case of multiplexing the keyboard, if the first character
goes to the first VM, the second goes to the second VM, etc., how can
this be correct (it isn't), and what am I missing from this picture?
This archive was generated by hypermail 2.1.6 : Mon Mar 01 2004 - 17:28:12 PST