Review: Experience with Grapevine: The Growth of a Distributed System

From: Raz Mathias (razvanma_at_exchange.microsoft.com)
Date: Mon Feb 02 2004 - 16:33:19 PST

Next message: David Coleman \(Roxio Inc\): "Grapevine review"

Previous message: Nathan Dire: "Review of Grapevine"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This paper gives us an account of experiences with the Grapevine system.
In short, Grapevine can be described as a distributed registry service
and message queuing management system (with authentication, and access
control thrown in for good measure). The paper retrospectively analyzes
deployment problems, strategies for overcoming these issues, and the
effects taken from experience with the system over the course of a few
years.

The mechanics of the basic Grapevine system are relatively
straightforward. The system is divided into a message system and a
registry system. The registry consists of two kinds of entries,
individual entries (representing users and servers) and group entries
(comprising of a lists of individual entries). The names of each entry
is divided into a two-level hierarchy consisting of <registry name, name
local to the registry> pair. Registries names are distributed
atomically amongst a number of servers (e.g. a server either contains
all the names in a registry or none). In my opinion, limiting one's
self to two levels of naming is limiting, arbitrary, and bound to lead
to problems, as indeed it has for the authors of this paper. Their
purpose in using only two levels was to impose a structuring on the
topology of the network itself, forcing a geographic node to map an
integer number of registries. This idea combined with the demonstrated
demand for organizational partitioning can become problematic. As the
size of an organization grows to be larger than a single machine the
administrator is forced to create an artificial dichotomy in the naming
system which burdens management of the organization. A similar problem
(not directly related to the mechanics of the naming system, but
conceptually related to the scalability of hierarchies) manifested
itself in the management of large distribution lists where multiple
levels of a hierarchy would have produced much more scalable results.

I found the discussion of the mailbox distribution concerns was
particularly relevant even today. The basic idea was to optimize the
primary inbox for latency, the secondary inbox for availability and the
tertiary inbox for unforeseen failures (put it on the "other side of the
[local] internet"). I do believe that, in addition to the concerns
expressed, the paper should have placed more emphasis on scalability and
less on pure latency. I would argue that in a distributed messaging
system, the goal should be high throughput, and not necessarily low
latency; a corporation would much rather trade off suboptimal message
delivery time (increasing from, say two minutes to four minutes in the
worst case) than scalability (a lack of which imparts load on the
administrator's part). The concerns raised for registry replicas are
very relevant today. The ideas of reliability, availability, latency
concerns, and protection from catastrophic failure are all fundamentally
important concepts.

The introduction of the concept of thin-client access was particularly
interesting to me. Having a rich client can greatly simplify the
server's guarantees (e.g. the server can force sequential access to
messages) where as a thin client can force the server to support more
complex core guarantees (e.g. random access to messages). I had never
looked at the thick/thin client problem from this level.

The paper also goes into the administrative surprises of propagation
delays (all of which are still surprises to administrators twenty years
later). Despite its numerous positive points (too many of which to give
individual attention to in this short summary), I believe that the paper
missed on a few issues. First off, although the registry system is
distributed, it has now become the central repository of naming,
location, authentication, and authorization. As such it becomes the
lifeblood of a corporation (anything but 100% uptime would be
phenomenally costly). As such, I believe the system should be promoted
to support awareness and control of the underlying transport mechanisms.
It should be able to take priority in switching and routing and should
be able to guarantee a quality of service regardless of whatever else is
going on the network (the system was explicitly constructed to avoid
dealing with underlying transport). Next, there is the issue of
duplicate message detection. I've personally run into this specific
issue when trying to display a merged view of multiple mailboxes on a
client device. The ideal solution, if it had already been built into
the system, would have been to create a globally unique identifier for
each message. Instead on this system, we would have to do
property-by-property comparisons, a dangerous game to get into when
routing servers can potentially add, modify, or delete properties.

This paper reads like a Microsoft Active Directory/Exchange deployment
guide, except that it was written twenty years ago. The ideas on
scalability, reliability, availability, and all the other -abilities
introduced in this paper really make this a must-read for any server
applications developer. I personally enjoyed reading the concerns
raised and hearing the authors' inclinations toward enumerating the
tradeoffs of various solutions rather than attempting to paint a pure
black-and-white picture of distributed systems design.

Next message: David Coleman \(Roxio Inc\): "Grapevine review"

Previous message: Nathan Dire: "Review of Grapevine"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Mon Feb 02 2004 - 16:32:54 PST