Review: Swift, et al. Improving the Reliability of Commodity Operating Systems.

From: Richard Jackson (richja_at_expedia.com)
Date: Wed Jan 21 2004 - 16:07:52 PST

  • Next message: Justin Voskuhl: "Review for "Improving the Reliability of Commodity Operating Systems""

    This 2003 paper by Swift, Bershad and Levy describes a novel method for
    preventing device drivers from causing system failure. The system is
    called Nooks, and is implemented within the Linux operating system.
    While similar methods have been described in other papers, the method
    presented here uses backwards compatibility as a key design constraint.
    Therefore, this system is extremely interesting as a practical solution
    to current problems.
     
    The authors take an interesting approach to the huge problem of buggy
    device drivers. That is, instead of fixing the bad drivers, create a
    software layer that handles the failures gracefully and also can restart
    the drivers after failure. Overall, this concept seems much more likely
    to succeed than just asking all the device driver developers to fix the
    buggy code.
     
    In general, this paper describes a method for protecting the kernel's
    operation from device drivers that may contain bugs or even malicious
    code. While device drivers are the main source of this problem, the
    paper observes that the same methodology can be used for other software,
    such as web servers or file systems.
     
    The paper introduces a concept called Nooks Isolation Manager(NIM),
    which is a software component that isolates the kernel from kernel
    extensions/interfaces. Then, clients(such as device drivers), will call
    these kernel extensions and communicate with the kernel via the NIM
    abstraction. The mechanism for these calls is labeled as an
    XPC(Extension Procedure Call), which the authors differentiate somewhat
    from LRPC or PPC.
     
    A large section of the paper covers the various implementation details.
    The primary goal of this is to transparently integrate with the system,
    such that user applications do not need to change. This section also
    includes a discussion of weaknesses of Nooks. One of these seemed
    especially bad: a malicious extension could deliberately corrupt system
    state, due to the extension running in kernel mode. Here, the authors
    argue that the system isn't perfect, but the reliability gains are still
    significant.
     
    A discussion of reliability shows that Nooks solves the main problem;
    99% of faults were handled without causing system crash. A large number
    of non-fatal failures were also handled, and the authors suggest the
    creation of a 'nanny' service to futher improve the handling of
    non-fatal errors.
     
    Regarding performance, my first impression is that performance must be
    much worse than a traditional device driver scheme. It turns out that
    performance is generally acceptable, but is application-dependant.
    Applications that do many XPC calls may have very poor performance, up
    to 60% worse. Other applications will be much better, with performance
    loss of 10% or less. The paper presents performance analysis for
    various components across the spectrum of possible uses of Nooks.
     
    Overall, I found this paper to be refreshing. I think this design would
    work very well to prevent many of the errors that happen today. The
    only major concern I have is that Nooks could cause developers of device
    drivers to become lazy and create even buggier code.
     


  • Next message: Justin Voskuhl: "Review for "Improving the Reliability of Commodity Operating Systems""

    This archive was generated by hypermail 2.1.6 : Wed Jan 21 2004 - 16:08:07 PST