Swift Paper

From: Brian Milnes (brianmilnes_at_qwest.net)
Date: Wed Jan 21 2004 - 15:06:36 PST

  • Next message: Chuck Reeves: "Improving the Reliability of Commodity Operating Systems Review"

    Improving the Reliability of Commodity Operating Systems - Swift et al

    The Nooks system is designed to catch and allow the operating system to
    recover from the majority of device driver errors. They do this by placing
    the device driver in a light weight protection domain and tracking its
    resources for cleanup. They report that 85% of Windows XP's problems are
    with its device drivers. The author's showed no changes to seven of eight
    applications they protected with Nooks and one that required thirteen lines
    of change.

    Their philosophy is to design for fault resistance and errors, not fault
    tolerance and abuse. They lie between crashing the system on a bug to
    prevent data corruption and building a virtual machine to encapsulate an
    extension. They isolate the extension, recover when it fails and allow this
    to run existing drivers and extensions. They implement this in an isolation
    manager that sits between the kernel and the extension. The extension has
    limited memory write accesses and communicates with the kernel using an
    Extension RPC. An interposition manager tracks the calls to the extension
    and that the data is tracked by an object manager. Bad calls the kernel and
    hardware errors are trapped and the recovery manager is called.

    The system required 22KLOC for Linux and only 18 months of development. The
    basic interfaces are wrapped and memory protection is done using the page
    table for portability. The extensions run in a sort of mini-process with
    stacks, a heap, I/O buffers and an object table. The wrappers written for
    each new extension declined, suggesting that it would be easy to wrap new
    extensions.

    They tested this using fault injection that inserted bad instructions to
    generate faults such as a bad parameter. The extensions were mostly tested
    under VMWare to make the system easy to restart in a clean state. They
    generated 400 faults, 317 of which crashed the OS. Nooks detected and
    recovered on 313 of these but not the four deadlocks.

    The performance on small devices, such as the soundcard tested, was
    imperceptibly slower. The Ethernet device experienced 10% slowdown on a read
    and 14% more CPU usage, which is significant. The performance on a highly
    optimized CPU intensive kernel HTTP benchmark was significantly reduced by
    more than half which presents a practical problem for its use.

    For the devices in my XP notebook, I'd love to have them all wrapped. The
    vast majority of user's problems are reliability and nothing more. For
    devices in my server system, I'd wrap everything that did not impact my
    performance. It would be nice if the wrapper could be used to help find the
    fault for later analysis without disturbing a production machine with the
    cost of a crash dump. A very nice idea indeed, has it been implemented in
    industry?


  • Next message: Chuck Reeves: "Improving the Reliability of Commodity Operating Systems Review"

    This archive was generated by hypermail 2.1.6 : Wed Jan 21 2004 - 15:06:42 PST