Review of "Experience with Grapevine: The Growth of a Distributed System"

From: Jeff Duzak (jduzak_at_exchange.microsoft.com)
Date: Mon Feb 02 2004 - 10:26:43 PST

  • Next message: Chuck Reeves: "Experience with Grapevine: The Growth of a Distributed System."

    This paper describes the Grapevine system, and talks about a number of
    issues that have arisen due to its growing load.
     
    Grapevine is a distributed database, primarily used for messaging,
    although it is also used for file sharing, authentication, and binding
    of RPC interfaces, as we read about in the previous paper. The system
    consists of 17 servers which are distributed among several physical
    locations, some very remote. All configuration information is
    replicated on each of the servers.
     
    The system was designed with a maximum load in mind (30 servers and
    10,000 users). This seems to be a very low limit, compared with systems
    today. Further, the idea of designing for a certain maximum load seems
    somewhat foreign, as it seems that systems today attempt to scale
    indefinitely.
     
    The paper talks about a number of problems that have arisen with the
    growing load of the system. One example is the processing of
    distribution lists. In the original system, in order to process a
    message sent to a distribution list, a single server had to resolve the
    address of each member of a distribution list and send the message to
    that address. This could tie up that server for up to 10 minutes, for a
    distribution list with 500 members. The authors discuss a method by
    which this particular problem could be alleviated. The solution,
    however, requires manual intervention of the administrators of the
    system to break up a distribution list into several smaller lists.
     
    In general, it seemed that the system required a lot of manual tweaking.
    For instance, there was no automatic load balancing. Load balancing was
    achieved through explicit assignment of servers as the primary or
    secondary stores for each registry, or group of users. Further, manual
    configuration was required to mitigate the effect of unreliable
    connections, and to localize data close to users likely to access it.
     
    The paper is much different from the other papers we have read, in that
    it spends a lot of time talking about the usage of the system, both from
    the point of view of an end-user and that of an administrator. It talks
    about social aspects of the system. For instance, it talks about the
    benefit of informing a user of an error, even if the user can't do
    anything about it. Such human considerations are, of course, very
    important, but don't appear very frequently in published papers.
     
    One final interesting aspect of this paper is that it talks briefly
    about the costs of making changes to a system already in widespread use.
    The paper notes that, while there are a number of potential improvements
    that could be made to the system, very few of these will be implemented
    because of the potential disruption of the existing system. This is a
    vital issue pertinent to all production systems.


  • Next message: Chuck Reeves: "Experience with Grapevine: The Growth of a Distributed System."

    This archive was generated by hypermail 2.1.6 : Mon Feb 02 2004 - 10:26:54 PST