From: Brian Milnes (brianmilnes_at_qwest.net)
Date: Wed Jan 21 2004 - 15:06:36 PST
Improving the Reliability of Commodity Operating Systems - Swift et al
The Nooks system is designed to catch and allow the operating system to
recover from the majority of device driver errors. They do this by placing
the device driver in a light weight protection domain and tracking its
resources for cleanup. They report that 85% of Windows XP's problems are
with its device drivers. The author's showed no changes to seven of eight
applications they protected with Nooks and one that required thirteen lines
of change.
Their philosophy is to design for fault resistance and errors, not fault
tolerance and abuse. They lie between crashing the system on a bug to
prevent data corruption and building a virtual machine to encapsulate an
extension. They isolate the extension, recover when it fails and allow this
to run existing drivers and extensions. They implement this in an isolation
manager that sits between the kernel and the extension. The extension has
limited memory write accesses and communicates with the kernel using an
Extension RPC. An interposition manager tracks the calls to the extension
and that the data is tracked by an object manager. Bad calls the kernel and
hardware errors are trapped and the recovery manager is called.
The system required 22KLOC for Linux and only 18 months of development. The
basic interfaces are wrapped and memory protection is done using the page
table for portability. The extensions run in a sort of mini-process with
stacks, a heap, I/O buffers and an object table. The wrappers written for
each new extension declined, suggesting that it would be easy to wrap new
extensions.
They tested this using fault injection that inserted bad instructions to
generate faults such as a bad parameter. The extensions were mostly tested
under VMWare to make the system easy to restart in a clean state. They
generated 400 faults, 317 of which crashed the OS. Nooks detected and
recovered on 313 of these but not the four deadlocks.
The performance on small devices, such as the soundcard tested, was
imperceptibly slower. The Ethernet device experienced 10% slowdown on a read
and 14% more CPU usage, which is significant. The performance on a highly
optimized CPU intensive kernel HTTP benchmark was significantly reduced by
more than half which presents a practical problem for its use.
For the devices in my XP notebook, I'd love to have them all wrapped. The
vast majority of user's problems are reliability and nothing more. For
devices in my server system, I'd wrap everything that did not impact my
performance. It would be nice if the wrapper could be used to help find the
fault for later analysis without disturbing a production machine with the
cost of a crash dump. A very nice idea indeed, has it been implemented in
industry?
This archive was generated by hypermail 2.1.6 : Wed Jan 21 2004 - 15:06:42 PST