From: Jeff Duzak (jduzak_at_exchange.microsoft.com)
Date: Mon Feb 02 2004 - 10:26:43 PST
This paper describes the Grapevine system, and talks about a number of
issues that have arisen due to its growing load.
Grapevine is a distributed database, primarily used for messaging,
although it is also used for file sharing, authentication, and binding
of RPC interfaces, as we read about in the previous paper. The system
consists of 17 servers which are distributed among several physical
locations, some very remote. All configuration information is
replicated on each of the servers.
The system was designed with a maximum load in mind (30 servers and
10,000 users). This seems to be a very low limit, compared with systems
today. Further, the idea of designing for a certain maximum load seems
somewhat foreign, as it seems that systems today attempt to scale
indefinitely.
The paper talks about a number of problems that have arisen with the
growing load of the system. One example is the processing of
distribution lists. In the original system, in order to process a
message sent to a distribution list, a single server had to resolve the
address of each member of a distribution list and send the message to
that address. This could tie up that server for up to 10 minutes, for a
distribution list with 500 members. The authors discuss a method by
which this particular problem could be alleviated. The solution,
however, requires manual intervention of the administrators of the
system to break up a distribution list into several smaller lists.
In general, it seemed that the system required a lot of manual tweaking.
For instance, there was no automatic load balancing. Load balancing was
achieved through explicit assignment of servers as the primary or
secondary stores for each registry, or group of users. Further, manual
configuration was required to mitigate the effect of unreliable
connections, and to localize data close to users likely to access it.
The paper is much different from the other papers we have read, in that
it spends a lot of time talking about the usage of the system, both from
the point of view of an end-user and that of an administrator. It talks
about social aspects of the system. For instance, it talks about the
benefit of informing a user of an error, even if the user can't do
anything about it. Such human considerations are, of course, very
important, but don't appear very frequently in published papers.
One final interesting aspect of this paper is that it talks briefly
about the costs of making changes to a system already in widespread use.
The paper notes that, while there are a number of potential improvements
that could be made to the system, very few of these will be implemented
because of the potential disruption of the existing system. This is a
vital issue pertinent to all production systems.
This archive was generated by hypermail 2.1.6 : Mon Feb 02 2004 - 10:26:54 PST