Review: improving the reliability of commodity OSes

From: Cem Paya 98 (Cem.Paya.98_at_Alum.Dartmouth.ORG)
Date: Wed Jan 21 2004 - 14:35:28 PST

  • Next message: Raz Mathias: "Review: Improving the Reliability of Commodity Operating Systems"

    Review: Improving the reliability of commodity operating
    systems
    Cem Paya, CSE 551P

    In this paper the authors describe a general mechanism for
    modifying kernels to provide fault tolerance on
    extensions. The most common type of extension is a device
    driver, which are responsible for majority of crashes in
    commodity operating systems. (Figure quoted in the paper
    is 85% for Windows XP) In spite of all certification
    programs and admonishments to developers about writing
    kernel-mode code, OS designers have little control over
    quality of device drivers and remain at the mercy of
    hardware vendors. Gracefully recovering from buggy driver
    code would be a huge reliability improvement.

    Nooks is the proposed design in the paper. Authors take a
    pragmatic stand, emphasizing backwards compatibility and
    minimum impact to existing kernel extensions. This is
    crucial for making Nooks relevant to “commodity operating
    systems” targetted in the paper such as Windows, Linux,
    MacOS where the large installed base makes it impractical
    to start rewriting all device drivers in an entirely new
    framework. No type safe languages, different kernel mode
    APIs for interfacing to drivers etc. Extensions are
    authored in C as before, the choice of language for
    commercial systems development. Instead Nooks relies on
    memory protection to prevent faulty extensions from
    accidentally tampering with kernel state. Kernel itself
    has read/write access to extensions address space but
    extensions can only read kernel data structures. A new
    procedure call mechanism dubbed XPC is introduced to
    handle control transfer between the extension and kernel
    in either direction. Finally the kernel tracks object use
    by extensions. This is a kernel mode analog for a
    conservative garbage collector. Same pragmatic approach
    applies to focusing on catching unintended errors instead
    of taking up the futile battle against malicious Trojan
    code executing in kernel mode. Swift et. al. readily
    acknowledge that there are ways a driver can intentionally
    corrupt kernel state even with Nooks in place.

    A very innovative idea introduced in Nooks is the notion
    of recovery. Defense-in-depth makes sense: instead of
    betting on making the extensions 100% reliable—which the
    authors admit is not achieved here, there are edge cases
    where some bugs are not caught, including dead-locks that
    can freeze the system—there is a mechanism in place to
    recover from the inevitable. Especially for device drivers
    this is relatively easy: unloading and reloading the
    driver will often bring the hardware to clean state. In
    other cases such as the VFAT file system included in the
    test set it’s more tricky: when the extension is stateful
    with persistent storage, crashes can lead to permanent
    data corruption.

    Nooks was implemented on Linux, which interesting enough
    appears to be a worst-case scenario because of its lack of
    abstraction around kernel data structure and large # of
    entry points between kernel-extension that must be
    wrapped. Despite the complication, an experimental set up
    involving random fault injection into 8 different
    extensions showed that Nooks could reduce the number of
    systems by an impressive 99%. Performance degradation
    varied from none to significant (over 2x slow down on a
    kernel-mode HTTP server) Authors point out that the trade-
    off is quite acceptable for low bandwith devices, but this
    is questionable. Mouse and keyboard drivers are relatively
    simple and straightforward. It’s high bandwidth
    applications such as the network card or graphics that try
    to move large amounts of data asynchronously that have
    complicated code prone to errors, and this is where Nooks
    can impose considerable delay due to XPC. Some of this is
    due to x86 architecture not coping well with TLB switches
    but any kernel mode extension that is performance bottle-
    neck—which is where Nooks would be most likely to do good,
    since such extensions have many hacks to speed up
    operations—is likely to have performance issues.


  • Next message: Raz Mathias: "Review: Improving the Reliability of Commodity Operating Systems"

    This archive was generated by hypermail 2.1.6 : Wed Jan 21 2004 - 14:35:49 PST