Review: Swift, Bershad & Levy, Improving the Reliability of Commodity Operating Systems

From: Ian King (
Date: Wed Jan 21 2004 - 09:36:51 PST

  • Next message: Tarik Nesh-Nash: ""Improving the Reliability of Commodity Operating Systems" Review"

    This paper describes Nooks, an infrastructure to enhance the stability of an
    operating system by protecting the OS core from "extensions", such as device
    drivers and file systems. Nooks is designed to supplement an existing operating
    system and extensions with minimal or no modification. For this discussion,
    Nooks is implemented on the Linux operating system.

    The authors cite a well-known problem in the software industry: extensible
    operating systems are rendered vulnerable primarily through their extensions.
    For instance, in the Windows operating system, this paper claims that 85% of
    system failures are because of defective device drivers; this matches my
    understanding and experience. The observation is made that writers of device
    drivers are not typically as skilled in the programming art as those who write
    operating systems; one inference is that adding complexity to the task of
    writing device drivers is not likely to improve the resulting quality. Further,
    while some have advocated the use of type-safe languages as the solution (and
    shown it can be effective), C is the most commonly used programming language for
    system-level components; Nooks is targeted at this environment.

    Nooks sets itself modest goals: rather than "making everything perfect
    everywhere", the goal of Nooks is to provide "better" reliability. This is a
    distinct departure from many papers, which seek to attain an ideal goal, and
    identify how the resulting work product falls short of that goal. The Nooks
    approach is intended to provide protection in the event of simple mistake, and
    makes no effort to guard against intentional acts. This paper is quite pragmatic
    in its goals and its approach.

    An important element of the tool set employed is the concept of protection
    domains, and the implementation is similar to that of Opal; however, the
    intentions are more modest, and the feature set is likewise less comprehensive.
    Rather than expressing rich sharing semantics through protection domains (as in
    Opal), Nooks uses the principle primarily to restrain extensions from abusing
    their privileged position in kernel space.

    Nooks also employs protective entry points, which are added to the kernel and
    the extensions by means of "wrappers." At its simplest, this means that an
    extension wishing to communicate with the kernel instead calls a wrapper
    function; the wrapper validates the call and passes it to the kernel. This
    mechanism is used in both directions of communication. By validating
    parameters, such common errors as passing null pointers or dangling references
    are caught before they can lead to serious damage to kernel structures.

    Nooks policy can be decided on a per-extension basis; while some extensions
    might be safely restarted after a flaw (for instance, a serial port driver),
    while others may require further action before they can be safely used (imagine
    the situation where a driver defect has corrupted a filesystem). One of the
    benefits of Nooks' wrapper structure is that the wrapper entry points also
    provide an opportunity for tracking of all resources passed across that
    boundary; upon failure of a driver, the Nooks infrastructure can use that data
    to clean up resources that might otherwise be used in a corrupted state or lost

    The authors point out that the intent of this protection is to allow the kernel
    to remain functional; there is really nothing inherent in the Nooks strategy
    that protects the objects accessed through the extensions, such as filesystems
    or network streams. However, by allowing the kernel to maintain its integrity,
    the opportunity exists for the kernel to recover the system in many cases.
    (Note that this strategy is also used in certain existing operating systems such
    as QNX, which is a microkernel architecture.)

    The test strategy was discussed in considerable detail (a pleasant happening)
    and was quite clever. The introduction of deterministic, pseudo-random
    amendments to the test applications allowed for considerable variety in failure
    modes. Recognizing that there are in fact commonly observed classes of defects,
    the authors also introduced examples of such defects outside the automated
    environment. The results were significant, with a large number of crashes being
    avoided. There is still some question as to whether the system was left in a
    truly stable state after the recovery, but elsewhere the paper discusses this
    problem and states that Nooks as implemented cannot fully solve it.

    One might be concerned that all this isolation and validation would impact
    performance, and the authors set out to determine that impact with real-world
    scenarios. There was always some degradation of performance, from negligible to
    significant, with a 60% decrease in throughput for a Web server application
    being the most severe. While this scenario was probably the most contrived (a
    Web server IN the kernel?), it also demonstrated that Nooks could not make a bad
    performance scenario better; the CPU utilization in this case was already very
    high, and there is nothing inherent in Nooks that addresses performance.

    On the positive side, Nooks definitely accomplished its goal of minimal impact
    to the existing code base. Nearly all drivers were run as written, and the
    modifications to the kernel were small and well-bounded. Moreover, for those
    exceptionally inefficient scenarios, there is no reason to run them within the
    Nooks protection scheme; unprotected extensions can coexist with Nooks-wrapped
    extensions, with no effect of the Nooks modifications on the unprotected

    Nooks presents a set of strategies to reduce the impact of defective device and
    service drivers on an operating system. As with most protection or security
    strategies, there is a tradeoff between safety and performance.

  • Next message: Tarik Nesh-Nash: ""Improving the Reliability of Commodity Operating Systems" Review"

    This archive was generated by hypermail 2.1.6 : Wed Jan 21 2004 - 09:57:13 PST