From: Cem Paya 98 (Cem.Paya.98_at_Alum.Dartmouth.ORG)
Date: Mon Mar 01 2004 - 15:18:13 PST
Paper review: Disco
CSE 551P, Cem Paya
Rosenblum et. al. describe an architecture codenamed Disco
that can run commodity operating system designed for
uniprocessors on a scalable, multi-processor architecture.
Main idea is to virtualize the hardware as a collection of
virtual processors. Each such processor can run its
separate OS, independently of others. In other words the
same machine can run Linux and Windows NT at the same time,
with each OS tricked into believing that it is executing on
a single processor machine. Main objective here is to take
an unmodified, off-the-shelf-OS designed for single
processor execution and transparently run it on a multi
processor system. Proposed design supports having more VMs
than there are physical processors so one could try to
multiplex between large # VMs, similar to the Denali
isolation kernel. But whereas Denali focused on scaling to
hundreds of VMs on the same (possibly single proc)
hardware, the novel aspect of Disco is the new approach to
designing operating systems for scalable multi-processors.
In simplified terms, one does not design for SMP or NUMA
explicitly at all: instead multiple OSes—one for each CPU
or group of processors on the NUMA-- built without
awareness of these considerations are layered on top of a
virtual machine monitor which becomes responsible for
scaling. This paper lays out the blue-print for that VMM
and Disco is a prototype implemented on one particular,
experimental NUMA hardware, the Stanford FLASH machine.
Authors bring out a good point about commodity OS support
as contraining factor and obstacle for innovation in
hardware design. Without wide-spread operating system
support, new hardware functionality and design paradigms
can’t gain traction with users. Growth of SMP on x86
architecture coincides with Windows NT and more recently
Linux support. More recently Windows has added support for
NUMA architectures into the kernel and Linux will likely
follow suit, enabling a new generation of server products.
From here it is a natural conclusion that being able to add
support for such features into to an existing OS without
major effort is important. (Problem is, at least
historically speaking, NT and Linux have both SMP support
today so this is a sunk cost. It is also not true that
systems overall becomes more complex/buggy as a result:
users that are on uniprocessor architecture can be
perfectly shielded from changes to support more advanced
architecture. For example, NT has two different versions of
the kernel, single and multiple-proc and installation
process chooses the correct one. Linux likewise can be
built out of a branch that is uniprocessor only. Also role
of market forces can’t be ignored: if the hardware
configuration is important—SMP/NUMA are both performance
critical—there is competitive pressure to squeeze every
last optimization. Even if VMM works as first
approximation, for faster development times, the OS will
eventually migrate towards native support.)
Disco emulates the raw hardware of MIPS R10000 processor,
with a few tweaks necessary to support unmapped kernel
virtual memory. There are some extensions to the
architecture but these are optimizations designed for Disco-
aware operating systems—no changes are required in
principle to existing code to run on Disco. R10000 did
cause some problems because of the above memory access
issue—authors point out this is not applicable to x86 or
Alpha. Getting IRIX5.3 to run on Disco required changing
some header files and recompiling/relinking the kernel. On
most other architectures a “commodity” OS could run without
modification. Any optimizations to run more efficiently on
Disco could be confined to a hardware abstraction layer or
HAL, which again is a common design feature.
Similar challenges as Denali were faced here, such as
necessity to virtualize I/O devices and provide “virtual”
physical memory which the VM believes is the “real” memory
of the underlying hardware. Unlike Denali which did not
provide MMU functionality, Disco has a very sophisticated
implementation that manipulates TLB entries in lock-step
with the way the host OS running on the VM manipulates what
it thinks are the TLB entries. It also interposes on disk
accesses to implement efficient reads on shared data. Such
sharing is only at the implementation level, using copy-on-
write semantics—the actual VMs can not communicate with
each other except through high level network protocols.
Most remarkable aspect of Disco is the ability to abstract
away NUMA-nature of the system by exposing a flat uniform
address space to the VMs while combatting performance
degradation with intelligent memory management. Among other
tricks, it can move pages around in order to improve
locality. With help from the hardware which keeps track
reads, it chooses between migrating to the most frequently
active node or duplicating when there are multiple readers.
Cem
This archive was generated by hypermail 2.1.6 : Mon Mar 01 2004 - 15:18:28 PST