From: Ian King (iking_at_killthewabbit.org)
Date: Wed Feb 25 2004 - 17:14:39 PST
The authors describe a clustered system designed to provide Internet mail
service to a large user population. Primary goals were scalability, fault
tolerance, and cost effectiveness; certain other goals, while not ignored, were
subordinate, such as consistency and comprehensive availability. The authors
point out that the system described here can be utilized for other types of
Internet services such as netnews, but distinguish their implementation from
web-server clusters by the pattern of data usage: web service implies primarily
read access, while email systems are write-intensive.
A distinguishing factor of Porcupine is that any machine (with minimal
restrictions) can serve any function in the overall cluster at any given time.
(The chief limitation is that a node receiving SMTP traffic must be identified
to Internet clients.) Load balancing is fully distributed; a module that has
work for another type of module has a periodically updated (but possibly
outdated) table of all nodes' current loading, and can make a choice of
recipient based thereon. A clever feature is that which restricts completely
unrestrained distribution in the case of "mailbox fragments"; spreading
fragments too widely would degrade the system by requiring too much network
traffic in assembling a user's mail from all the fragments. Instead, a "loose"
upper limit is placed on the number of fragments; that limit can be violated in
extreme cases, such as full disks or maxed CPU utilization on a target system.
Many large mail systems make use of static partitioning of the user population
to distribute the load. Two issues with this are that use is not consistent
across a user population (they'd hate to maintain my inbox), and scaling the
system involves subdividing a partition, i.e. migrating some portion of users'
data between machines. The dynamic balancing mechanisms of Porcupine obviate
this by distributing in relatively small, nearly arbitrary partitions across the
system nodes.
Porcupine makes use of both soft state and relaxed consistency requirements to
simplify structure and speed common behaviors. It further allows for certain
benign "errors", such as duplicate messages or temporary unavailability of some
subset of a user's messages. Replication is used both to reasonably ensure that
no message is truly lost until deleted by the user, and to allow for load
balancing on mailbox reading. As in other systems, loss of soft state is
corrected either by replication from another node or examination of relevant
"hard state". One particularly important feature is that any replica can serve
as a replication source; generational issues are resolved through use of
timestamps. This does not place the onus of providing replication on one node,
nor run the risk of that one node being the faulting one and thereby incapable
of recovering itself.
The paper goes into considerable detail regarding the consistency protocols for
messages and system state. Messages and user information are viewed as "hard
state", which cannot be lost, while most other state is soft. Certain
structures are replicated for fault tolerance, others for performance (such as
user maps).
In simulated-load studies, the researchers demonstrated that Porcupine performs
at least as well as a statically-partitioned system, and with certain
assumptions outperformed it. It also scales linearly with additional nodes, up
to the limitations of local network bandwidth. Nodes become disk-bound before
becoming CPU-bound; the theoretical maximum performance was demonstrated by
assuming "infinitely fast" disks, which was simulated by discarding bodies and
retaining message digests in memory.
Adding machines for either recovery or scaling are both handled by the same
mechanism of repopulating the soft state of a new machine that is running
Porcupine software; no administrator interaction is required. It was noteworthy
that the latency of these events was in the tens of seconds.
This archive was generated by hypermail 2.1.6 : Wed Feb 25 2004 - 17:26:57 PST