Review of Grapevine paper

From: Praveen Rao (psrao_at_windows.microsoft.com)
Date: Mon Feb 02 2004 - 15:52:21 PST

  • Next message: Cliff Schmidt: "Review of Schroeder et al. "Experience with Grapevine: The Growth of a Distributed System""

    This paper discusses Grapevine which is a distributed, replicated system
    that provides message delivery, naming, authentication, resource
    location and access control services in an internet of computers.
    Grapevine is implemented as a Mesa program that runs on a set of
    dedicated server computers. Communication between grapevine servers and
    among client and servers is built on internet protocols.

    At the time of writing of the paper Grapevine was an existing system and
    the paper discusses successes and problems with the system. The issues
    identified are relevant to most distributed systems and highlight what
    distributed systems designers should watch out for.

    In terms of scalability, the system worked generally well for the
    assumed conditions - organization e-mail but faced problems with
    different kinds of usage patterns e.g. mailing lists. Another level of
    indirection was the suggested solution. Another scaling problem could
    occur depending upon the size of internet. Grapevine used direct
    connection between the accepting server and the preferred mailbox to
    deliver messages. This would be impractical as the size of the internet
    grows and reliability of links decreases. Instead multi-step forwarding
    should be used.

    In terms of configurability, it was hard to develop guidelines for
    configuration of the system. IMO, as the systems become complex
    architecture needs to allow adaptive configuration of the system.

    The other problem that arises in distributed systems is keeping various
    servers in sync and the load generated by such sync process. Grapevine
    tolerates inconsistencies for 24 hours. Grapevine faced problems with
    distributed updates and merge load. One of the factors was that during
    merge of a registration entry, entire entry was sent to other servers
    and not just the change. This generated massive load at times. This was
    changed to propagate just the change.

    A design goal of Grapevine was to become the source of authentication
    and access control information on the internet. For example, file-serves
    relied on Grapevine for access check. Pitfalls - 1)both file-servers and
    Grapevine have to be available 2) access check takes more time 3) these
    checks load Grapevine server. To overcome these, fileserver caches the
    access control checks which expire in 12 hrs. Despite this change access
    check was slow. Problem turned out to be nested groups.

    Whenever usage deviated from the assumption, system ran into problems.
    For example, allowing remote reading of mails without automatic deletion
    created problems for the system (system anticipated sequential reading
    and deletion of the messages).

    In terms of manageability, distributed systems need to provide tools for
    operators (who are not necessarily highly knowledgeable about the
    system) to be able to perform administrative tasks. Grapevine allowed
    for this in most cases, and only required expert intervention in fringe
    cases.

    In terms of reliability, authors state that lack of resources (e.g. CPU
    cycles/memory/disk space, network bandwidth) is a primary cause of
    failures (other than, controllable factors such as sw/hw bugs, and
    communication failures). Grapevine attempts to control resource usage by
    placing limits on the number of connections a server can accept. It is
    not clear to me if this was a static (preset) limit or dynamic limit
    (depending upon the actual resource usage). Current computations are not
    aborted even if resources fall below threshold though. Grapevine uses
    redundancy to make sure there are no single points of failures.
    Redundancy also helps with coping up with load spikes and reduces the
    chances of system failures due to lack of resources.

    The proposed changes, authors are wary of trying out since the system is
    already in use. This highlights difficulties in changing a production
    system esp distributed system. Of course, the other factor cited - their
    forgetting the implementation is probably not as excusable.


  • Next message: Cliff Schmidt: "Review of Schroeder et al. "Experience with Grapevine: The Growth of a Distributed System""

    This archive was generated by hypermail 2.1.6 : Mon Feb 02 2004 - 15:52:28 PST