From: Greg Green (ggreen_at_cs.washington.edu)
Date: Tue Feb 24 2004 - 22:19:25 PST
This paper describes the implementation of a highly available and
scalable internet mail service on a commodity pc cluster. The design
constraints were easily managed with self-configuration and
self-healing. Linear scaling so that nodes or disks can be added to
improve throughput. Highly available so that good service is available
to all users even if part of the cluster is down. Finally it should
have comparable performance and scale to hundreds of machines.
The system consists of several services that run on each machine in
the cluster. They consist of membership managers, user profile
manager, a delivery proxy, and a retrieval proxy, and a replication
manager. The mail for each user is stored in mailbox fragments
distributed on the cluster. There is a user map on each node that maps
users to node responsiblities for profiles and mailbox fragments. A
user profile databases keeps specific client information and is
relatively static. There is also a cluster membership list on each
node.
Mail coming into the cluster is handled by a delivery proxy. This
proxy retrieves the mailbox list from the user's management node and
then chooses a node to deliver the mail too. Mail retrieval works in a
similar way.
Management is done by a membership protocol. When a node discovers
that a node is down, or a new node has joined, it initiates a protocol
that has all current cluster members reply. The node then broadcasts
the new membership and a epoch id to all nodes. Also in this step, the
new hash is created that maps user management to nodes. Each node then
examines it's state for changes and sends those changes to the proper
nodes.
The mail messages can be replicated with a replication manager. This
allows a message to be updated anywhere, therefore no primary mailbox
manager is needed. The algorithm is uses BASE not ACID semantics so a
user could see inconsistent state amongst the different nodes until
stablization is achieved.
The paper had a large section on actual scaling and performance. The
performance was comparable to a similar cluster using existing mail
services and static assignment of users to nodes. With replication the
performance was not quite as good. The system was shown to be highly
scalable, with performance going up linearly with the number of nodes,
and also just by upgrading some of the nodes with faster disks.
I really liked this paper. It was described in sufficient detail that
I had a good idea of how the system worked. Automatic management and
load balancing is obviously a highly desirable trait, and this system
shows a way to deliver that, at least for internet mail. Since I
suffer from numerous exchange mail server outages and slowdowns, I
wish Microsoft would adopt something like this. It was interesting how
even upgrading disks on individual machine improved the
performance. This is a difficult thing to accomplish it seems to me. I
kept looking for the downside, as the papers tend to show their system
in the best light possible, and skim over the difficulties. I was
unable to spot the drawbacks here. I would like to know what they are
as there is bound to be something.
-- Greg Green
This archive was generated by hypermail 2.1.6 : Tue Feb 24 2004 - 22:19:29 PST