Jim Shearers review of Improving the Reliability of Commodity Operating Systems

From: shearerje_at_comcast.net
Date: Sun Jan 18 2004 - 12:41:04 PST

  • Next message: Honghai Liu: "Review of Reliability of Commodity OS"

    I found this paper, describing the architecture, implementation, and performance of the Nooks isolation manager, to be the easiest to read paper yet presented. Its arguments, descriptions, conclusions, and use of English are clean. As I struggle through it in the early hours of the morning, I thank the authors for that.

    Nooks seeks to improve the reliability of already existing “commodity” operating systems such as Linux or Windows XP by providing an isolation layer between the OS kernel and the myriad extensions (many of them 3rd party) that cause as much as 85% (more, in my personal experience) of the kernel crashes experienced on these systems. Nooks achieves this by virtualizing the interface between the kernel and the extension in a unidirectional trust relationship invoking four architectural functions: isolation through a new Extension Procedure Call (XPC) service, interpositioning of wrapper stubs on all control and data transfer mechanisms between the extension and the kernel, object tracking through a kernel data copy mechanism, and an extension restart and recovery mechanism. This last function is the most unique feature of Nooks. It doesn’t just protect the kernel, it tries to get the extension running again in such a way that it rescues the application itself. That Nooks is successful in this at all is a ma
    jor accomplishment, and the performance data suggest that it does it rather well in many cases! Nooks takes a pragmatic approach, assuming that extensions may contain errors, but are not actually malicious, and handling most, but not all extension failures. This approach provides the greatest return on investment, making Nooks viable in a commercial setting.

    Deployment of Nooks in a commercial setting could have a significant impact on that setting. For example, the Intel x86 architecture has a TLB flushing feature that interacts poorly with Nooks. A Nooks market could drive Intel to change this. As another example, Nooks produces as side effects: (1) a new kernel service in the form of the recovery function which applications themselves may be designed to exploit directly, and (2) a valuable development tool; Nooks inherently detects some kinds of faults that would otherwise be latent in a new extension. I am, however, unclear on some aspects of deploying Nooks such as “Who is responsible for extending Nooks when new drivers are created?” Placing the burden on the already suspect driver developers may result in corruption of Nooks, but Nook needs to be cognizant of a lot of driver detail such as the lifetimes of interface objects.

    I would contest two of the assertions in the paper, however. First, in section 6 it states “kHTTPd represents a poor application on Nooks”. However, figure 6 shows that Nooks caught the vast majority of fatal errors in kHTTPd. I argue that preventing these crashes is well worth the measured performance hit to kHTTPd (particularly since it was already noted that building a web server as a kernel-mode extension may not be wise to begin with). Which brings me to the second assertion: Section 7 claims “Where performance matters more than reliability, isolation may not be appropriate.” This is totally wrong. Since system crashes affect all applications on the system (and some that just happen to be talking to this system), we are discussing the performance of a single driver vs. the reliability of aggregate set of applications. The driver developer has no idea what applications these might be and so cannot reasonably make this trade in favor of performance! No reasonable trade exists in a multipurpose en
    vironment until the aggregate performance of the whole system drops below useful levels. The delusion that there is a broad range of acceptable tradeoffs of reliability for performance at any level that affects the entire platform is one of the great banes of the computer industry today!


  • Next message: Honghai Liu: "Review of Reliability of Commodity OS"

    This archive was generated by hypermail 2.1.6 : Sun Jan 18 2004 - 12:41:10 PST