From: Raz Mathias (razvanma_at_exchange.microsoft.com)
Date: Wed Feb 25 2004 - 17:39:15 PST
This paper presents several design goals and algorithms employed in
implementing an internet-scalable system that has "luke-warm" volatile
data (e.g. mail systems, in contrast with stock trading systems which
have hot, constantly changing data and proxies or search engines which
have read only data). Its most valuable contribution is its ability to
almost completely decouple system management from storage (separating
system management from availability/performance) at the expense of
somewhat weakened consistency in the presence of failures. This
decoupling allows the system to itself to maintain performance with
increased loads and to "heal" itself in response to hardware failures.
The paper interestingly begins by defining scalability as a function of
management (!) as well as availability and performance. Scalability is
often defined as the function mapping from load to latency. I think
that the inclusion of management in the definition is critical to the
system's ability to scale; performance results naturally from real-time
self-reconfiguration. As we've seen in the "Cluster-Based Scalable
Network Services" paper, centralized management creates unnecessary
sharing in the system which limits scalability. The obvious problem
here is how does one distribute management without compromising
performance? The solution presented in the paper involves defining the
shared management state as a "soft," calculated data that is maintained
by all nodes in the cluster (soft state was introduced by the last paper
we've read). Any node brought onto the system can easily recalculate
the full management state. The only hard state in the system is the set
of mail messages and the user profile database (user name, passwords,
etc.) which are replicated and fragmented across the system to provide
graceful degradation in the presence of failures.
Self management allows the system to reconfigure itself in response to
node failures, the detection of additional nodes, and the presence of
poor load-balancing. The central piece of management data in the system
is the "user map", a hash mapping between the user name and machine
responsible for managing that user's mail store fragments. When a new
node is introduced, the system can evenly redistribute the load of
management by negotiating and converging to a new user map. When a node
detects the presence of new machines, it becomes the coordinator in a
The Three Round Membership Protocol used to obtain the complete set of
nodes in the cluster. A new user map is computed on every node and the
delta's are computed between the old and new maps. These deltas are
then used by the nodes to direct the information transfer of message
store fragments in the system. To be effective, the delta between the
old node and the new one need to be sufficiently different to distribute
load evenly while sufficiently similar so as not to overburden the
underlying network (i.e. some amount of node affinity must be maintained
for scalability). This property of the hashing function (more of a
policy issue than a mechanism) is one that, I believe, needs to be
studied further and in greater detail. It would be nice to come up with
a hash function that takes deltas of various factors such as network
bandwidth, latency, machine availability, disk storage, etc before and
after the new node was added.
In my opinion the randomized method of fragmentation employed in this
system was a great idea. Instead of requiring that "some of the users
get none of the data some of the time," we have "more of the users don't
get all their data some of the time." I believe that this is exactly
the right kind of thinking. This critical tradeoff allows the system to
degrade gracefully without leaving any particular user without service.
I personally really enjoyed reading this paper as it presented some
convincing scenarios and tradeoffs (in addition to core algorithms) that
would be valuable in today's mail systems. The innovative use of soft
state in bringing up management information makes this system very
appealing.
This archive was generated by hypermail 2.1.6 : Wed Feb 25 2004 - 17:39:19 PST