Review: Improving the Reliability of Commodity Operating Systems

From: Raz Mathias (razvanma_at_exchange.microsoft.com)
Date: Wed Jan 21 2004 - 14:52:19 PST

  • Next message: Brian Milnes: "Swift Paper"

    This paper introduces an interesting in-kernel protection mechanism for protecting against faults in device drivers. It makes the salient point that device driver crashes account for a majority of operating system failures and proposes a practical, compatible mechanism for remedying the situation. The implementation of the proposed solution is named Nooks.

     

    The proposed system is compatible with current implementations of operating systems. That is, it does not require that device drivers be programmed in a special language, that capabilities are introduced, or that hardware be modified. Instead, it involves creating a "reliability layer" between device drivers and the kernel and on the existing page-protection mechanism being leveraged by current OS'es.

     

    A set of principles is proposed in the paper as being central to the in-kernel protection theme. The first is the principle of isolation. The first component of isolation, the protected domain, is implemented by keeping a copy of the kernel's page table which prevents extensions from modifying the kernel's memory (only reads are allowed), thereby limiting the scope of corruptions. Notice, though that the relationship is asymmetric and that the kernel still has the right to write all of the extension's memory. This reminds me of a limited form of capabilities, where the page table can be interpreted as a limited capability list, giving the domain access to hardware resources (pages). The second component of isolation is the XPC (extension procedure calls) in which all data passing between the kernel's domain and the extension's domain must pass through the "reliability layer," a set of stubbed function calls similar to proxy/stub mechanism of RPC calls. These functions, also known as wrappers, check parameters, make copies of objects for use by the extension, and utilize XPC to make calls from extensions to the kernel and vice versa.

     

    The second principle, interposition, is heavily tied to the concepts of wrappers and XPC's. This principle states that all method calls between extensions and the kernel are wrapped, and that any objects created in the meantime are tracked by the object tracker.

     

    The next principle, object tracking, maintains a list of kernel data structures used by extensions. When an extension fails, the this list is utilized to do garbage collection on the extension's resources, thereby recycling these resources for use by the kernel again.

     

    The final concept, the recovery mechanism, catches hardware and software faults and calls a user-level agent to determine what steps to take next. The recovery mechanism then has the option of releasing kernel resources (by way of the object tracker), unloading the extension, and possibly reloading the extension in a new domain.

     

    The paper gives us a view of the reliability and performance implications of implementing Nooks on a modern operating system (Linux). The reliability results are impressive, eliminating 99% of simulated faults in the extensions. Furthermore, the use of Nooks as a platform for developing drivers is also interesting. The system can easily recycle the driver during a fault, thereby reducing the debug-fault-reboot cycle of driver development. The paper also reports the range of results with respect to performance. The drivers protected by the Nooks system did comparably well in some tests (achieving around 90% of the original un-protected drivers), and poorly on others (notably, the kernel-level HTTP driver only performed 60% as well).

     

    In my opinion, Nooks is a great idea from the standpoint that reliability. It seems to me that this mechanism would be far more useful in an operating system in which it was optional (and nothing technically precludes us from implementing such a system). In some cases, it would be a good idea to default to protected mode (for drivers that are not heavily impacted). In other cases, it would be a good "safe mode" to implement in an operating system where drivers aren't necessarily signed or are known to be of substandard quality.

     

    In conclusion, I would look forward to seeing this idea implemented.

     


  • Next message: Brian Milnes: "Swift Paper"

    This archive was generated by hypermail 2.1.6 : Wed Jan 21 2004 - 14:53:18 PST