From: Manish Mittal (manishm_at_microsoft.com)
Date: Tue Jan 20 2004 - 17:17:53 PST
Kernel Extensions such as device drivers are a major cause for crashes
in commodity operating systems. This paper describes Nooks, an OS
subsystem that improves OS reliability by isolating the OS from driver
failures. It allows existing OS extensions such as device drivers and
loadable file systems to execute safely in commodity kernels. Nooks goal
is to prevent the vast majority of driver caused crashes with little or
no change to existing driver extensions and kernel code.
To achieve this, Nooks executes such extensions in lightweight kernel
protection domains so that the extensions cannot corrupt vital kernel
data structures. These domains have the same processor privilege as the
kernel but have their write access limited to a portion of the kernel's
address space. Nooks, also facilitates automatic recovery from such
failures.
Nooks is implemented inside the Linux 2.4.18 kernel on the Intel x86
architecture. Nooks design is for fault resistance i.e. the system must
recover from most faults. Also, the design is for mistakes and not
abuse. These principles are achieved by creating a new operating system
reliability layer that is inserted between the extensions and the OS
kernel. The reliability layer intercepts all interactions between the
extensions and the kernel to facilitate isolation and recovery. This
reliability layer is called NIM (Nooks Isolation manager) The NIM
provides four major architectural functions: Isolation, Interposition,
Object Tracking and Recovery.
The main novel part about this paper is "recovery" which most existing
systems for reliability and isolation do not achieve. Sometime Nooks
might have to use application specific recovery services in order to
make the state consistent after recovery. For example if a file system
crashes and Nooks simply restarts the process then the file system will
be in an inconsistent state. However, Nooks actually flushes stuff in
the cache before restarting the file system in order to reduce number of
inconsistencies.
Paper also describes various tests that are carried out on the system.
These tests reveal that Nooks recovered from 99% of the faults that
caused native Linux to crash. Some of the techniques such as fault
injection used to test this system are noteworthy. The synthetic fault
injection techniques are not described in detail. It would be
interesting to know how these faults are injected. Some of the manual
fault injections are described in great detail. Idea of using VMware
Virtual Machine to perform thousands of tests remotely is also very
appealing since it makes it easy to quickly and easily return the system
to a clean state following each set of tests. Overall, test and
performance statistics specified in this paper are sufficient to prove
the reliability of this system.
This archive was generated by hypermail 2.1.6 : Tue Jan 20 2004 - 17:17:56 PST