Review: Saito, et al. Manageability, Availability and Performance in Porcupine: A Highly Scalable Internet Mail Service

From: Richard Jackson (richja_at_expedia.com)
Date: Wed Mar 10 2004 - 13:35:48 PST


This 1999 SOSP paper by Satio, Bershad and Levy discusses Porcupine,
which is a cluster-based mail service that runs on commodity PC nodes.
 
The paper is divided into these sections: 1) overview of Porcupine, 2)
self-management capabilities, 3) replication/availability, 4) evaluation
and testing.
 
Porcupine is a scheme for building a cluster of servers that work
together to provide a service. These servers are just commodity PCs,
running Linux. The main service that is discussed is an Internet mail
service, however, the system could be used for various other things.
Some other examples might include: a file server, a calendar service,
and a bulletin-board service. A key property of this system is that it
supports a high frequency of writes. The key design aspect is
scalability, which has these properties: 1) manageability - must be
easy to use and self-healing, 2) availability - partial failure should
not be very noticeable, 3) performance - should be able to support
millions of messages per day. Regarding the architecture, each
Porcupine node is comprised of homogeneous software. That is, each node
can serve any purpose within the system. This differs from the SNS
paper, which allocated certain nodes for certain duties. Within each
Porcupine node, there are two kinds of state: hard state and soft state.
The hard state is state that must be strongly persisted, while soft
state can be reconstructed on an as-needed basis from hard state.
Examples of hard state include: mailbox fragments, user profile
database. Examples of soft state include: mailbox fragment list, user
profile soft state, user map, cluster membership list. Some of these
data structures are also replicated among all nodes. These include:
user profile database, user map, cluster membership list. Various
manager services are running inside Porcupine which maintain these
various data structures. Overall, the large list of data structures
makes this system seem very complex. However, the basic concept is that
each node keeps only a portion of the user's email data(mailbox
fragments), and a user must collect data from multiple nodes(indicated
by the mailbox fragment list) to view their entire mailbox. For mail
delivery and retrieval, a single Porcupine node acts as a proxy for the
user, and routes requests to the appropriate other Porcupine nodes,
based on the data in the various aforementioned data structures. The
paper assumes that some load-balancing scheme would exist atop this
proxy node, assuring that a single node will not be overloaded by proxy
duties. This seems like a weakness of the system, as the implementation
may not correctly balance the proxy load among all Porcupine nodes.
 
Regarding self-management, Porcupine makes various efforts to do this.
An extensive protocol exists for maintaining membership state. The
protocol uses a three-round membership protocol and broadcast
communication. Any node can act as a leader, and consensus is
determined based on replies from peers. A timeout is used to prevent
infinite waiting. This seemed to be a variant of the Paxos consensus
protocol. The protocol requires synchronized clocks, which seemed to be
a fairly weak assumption in this section. The user map is also
maintained and distributed using the last message in this three-round
protocol.
 
Replication is used to ensure that a node failure does not destroy all
copies of a user's stored data. By replicating to other nodes,
Porcupine achieves higher availability. The semantics for replication
are weak, with the following properties: 1) update anywhere, 2)
eventual consistency, 3) total update, 4) lock free, 5) ordering by
loosely synchronized clocks. These semantics are similar to the BASE
semantics in the SNS paper. A key property of the replication is called
"spread," which defines how many nodes receive the replicated data. It
is a good idea to minimize spread, as too much replication adds cost.
 
Finally, the system is evaluated. Various comparisons are made to show
how the system works, both with and without replication. They also
compared it to a naive implementation of the standard sendmail+popd
systems. The results are fairly intuitive: 1) replication adds a
significant cost to the system, 2) Porcupine with replication scales
about as well as sendmail+popd, 3) Porcupine without replication scales
much better than sendmail-popd, 4) Porcupine can adapt well to the
addition of a new node or even a single disk, 5) the reconfiguration
properties work well.
 
Overall, the Porcupine system seems like a great idea. The important
properties of this system include: differentiation between soft/hard
state, automatic load balancing based on outstanding IO requests, and
ease of manageability. The only parameter that needs manual tuning is
the spread. Compared to sendmail-popd, this system seems much better.
A key limitation of Porcupine is that all nodes must be able to see all
other nodes, resulting in communication efficiencies as the system
scales. This limitation does not prevent the system from performing
well. The evaluations show that 62 nodes would support about 562
million messages per day, before the network becomes the bottleneck.
This is impressive.
 



This archive was generated by hypermail 2.1.6 : Wed Mar 10 2004 - 13:36:00 PST