From: Reid Wilkes (reidwilkes_at_verizon.net)
Date: Sun Jan 25 2004 - 21:58:16 PST
This paper described "Nooks", a technology to improve the reliability of
what the paper terms "commodity operating systems". This paper was quite
refreshing to read - mostly I think because it deals with current and
familiar technologies and also because it takes a decidedly "industry"
approach. Rather than devising a solution to the problem of reliability by
creating an entirely new system architecture or programming language, the
authors in some sense "hack" a solution into existing operating systems.
This seems to be a much more realistic approach to problems in today's
commercial computer industry, and one much more likely to gain acceptance in
industry. The idea behind Nooks is to create a protected area in which
loadable kernel modules (a.k.a. drivers) can be run where it is less likely
that their malfunctions will bring down the entire system. Clearly the need
to help isolate driver failures from the rest of the system is huge, given
some of the data the authors present on the proportion of crashes on Windows
XP and Linux which are caused by driver malfunction. The authors also take
another quite refreshing stand in the paper by stating quite clearly at the
outset that goal of the project was not to prevent drivers from ever being
able to bring down the system, but rather to help improve on the common
failure cases. This again seems to be a highly practical approach and one
which in my experience is more likely to be successful (at least in
industry). The architecture of the system seems quite straight forward as it
essentially intercedes proxies in the communication paths between the
modules and kernel. These proxies are then able to check parameters and
detect when a driver is malfunctioning. The system then allows failing
modules to be reloaded and restarted. Most interesting to me about the
architecture is the assertion made that very little kernel or driver code
has to be modified to employ the Nooks system. Again, this engineer feat is
quite important for the realities of the commodity OS market. The final
section on the paper presents quantitative results of reliability and
performance testing. The reliability results are quite impressive and would
certainly hold promise for the concepts of the system. The performance
results are a little disappointing, but maybe not surprising overall. It
goes without saying that adding extra code into these paths in the kernel is
going to slow things down - although the performance slowdown in some cases
was quite dramatic. It was interesting to see that the authors had isolated
this slowdown to TLB refreshing; it then stands to reason that were a system
like this devised which might integrate tighter into the kernel - thus
requiring more substantial kernel code changes, that much of this
performance hit could be removed.
This archive was generated by hypermail 2.1.6 : Sun Jan 25 2004 - 21:58:18 PST