Improving the Reliability of Commodity Operating Systems

From: Manish Mittal (manishm_at_microsoft.com)
Date: Tue Jan 20 2004 - 17:17:53 PST

  • Next message: Song Xue: "Review of "Improving the Reliability of Commodity Operating Systems""

    Kernel Extensions such as device drivers are a major cause for crashes
    in commodity operating systems. This paper describes Nooks, an OS
    subsystem that improves OS reliability by isolating the OS from driver
    failures. It allows existing OS extensions such as device drivers and
    loadable file systems to execute safely in commodity kernels. Nooks goal
    is to prevent the vast majority of driver caused crashes with little or
    no change to existing driver extensions and kernel code.

    To achieve this, Nooks executes such extensions in lightweight kernel
    protection domains so that the extensions cannot corrupt vital kernel
    data structures. These domains have the same processor privilege as the
    kernel but have their write access limited to a portion of the kernel's
    address space. Nooks, also facilitates automatic recovery from such
    failures.

     

    Nooks is implemented inside the Linux 2.4.18 kernel on the Intel x86
    architecture. Nooks design is for fault resistance i.e. the system must
    recover from most faults. Also, the design is for mistakes and not
    abuse. These principles are achieved by creating a new operating system
    reliability layer that is inserted between the extensions and the OS
    kernel. The reliability layer intercepts all interactions between the
    extensions and the kernel to facilitate isolation and recovery. This
    reliability layer is called NIM (Nooks Isolation manager) The NIM
    provides four major architectural functions: Isolation, Interposition,
    Object Tracking and Recovery.

    The main novel part about this paper is "recovery" which most existing
    systems for reliability and isolation do not achieve. Sometime Nooks
    might have to use application specific recovery services in order to
    make the state consistent after recovery. For example if a file system
    crashes and Nooks simply restarts the process then the file system will
    be in an inconsistent state. However, Nooks actually flushes stuff in
    the cache before restarting the file system in order to reduce number of
    inconsistencies.

    Paper also describes various tests that are carried out on the system.
    These tests reveal that Nooks recovered from 99% of the faults that
    caused native Linux to crash. Some of the techniques such as fault
    injection used to test this system are noteworthy. The synthetic fault
    injection techniques are not described in detail. It would be
    interesting to know how these faults are injected. Some of the manual
    fault injections are described in great detail. Idea of using VMware
    Virtual Machine to perform thousands of tests remotely is also very
    appealing since it makes it easy to quickly and easily return the system
    to a clean state following each set of tests. Overall, test and
    performance statistics specified in this paper are sufficient to prove
    the reliability of this system.

     


  • Next message: Song Xue: "Review of "Improving the Reliability of Commodity Operating Systems""

    This archive was generated by hypermail 2.1.6 : Tue Jan 20 2004 - 17:17:56 PST