From: Richard Jackson (richja_at_expedia.com)
Date: Wed Mar 03 2004 - 12:57:28 PST
This 2001 SOSP paper by Rowstron and Druschel describes PAST, which is a
peer-to-peer(P2P) overlay network that can be used to reliably
distribute files across a network of unreliable nodes. The paper
focuses on the storage management and caching issues, but some general
information is provided as well.
The paper is organized into these sections: 1) overview of PAST, 2)
Pastry routing mechanism, 3) storage management, 4) caching, 5)
experimental results.
The first section describes PAST. Simply, it is a P2P network of
homogenous nodes that allows files to be stored in a reliable,
persistent way, even though the underlying nodes may enter and leave the
network at any time. The paper describes this as an "overlay network",
meaning that it is built on top of existing Internet infrastructure and
protocols. A client of PAST deals with only a single node, and can
issue the following commands: Insert, Lookup, Reclaim. The Insert
command adds a file to the PAST network. The Lookup command finds a
file on the PAST network and returns it to the client, the Reclaim
command attempts to delete/invalidate copies of the file stored in PAST,
but it is unreliable by design. All operations are performed at a
file-level, unlike CFS which deals with file chunks or blocks. Each
file is identified by a unique fileID that is generated based on some
user-provided randomness and then using a SHA-1 hash.
Pastry is the routing infrastructure that PAST uses to locate and insert
blocks. The concept is that each node and each file has a unique key
based on a hash of some random data. Using this hash, which is assumed
to be highly distributed, Pastry can route requests by evaluating the
128 msbs of the key and forwarding the request to a known node with the
numerically-closest key. A fairly complex routing table is maintained
on each Pastry node to support this. The routing information consists
of 1) routing table - this is multiple levels of metadata about node
locations, 2) a leaf set - a list of the closest nodes according to
numeric key comparisons, 3) a neighborhood set - a list of closest nodes
according to some proximity metric such as latency. The first two sets
of data are used for routing, the 3rd is used for node
insertions/deletions. The concept of proximity metric seemed weak to
me.
Next, storage management is described. The main design focus for this
area is that they wanted to balance storage among nodes as evenly as
possible. To do this, they created two design strategies: 1) replica
diversion, 2) file diversion. For #1, this means that, for cases that a
file cannot be stored on the appropriate node, it will be routed to
another node within the leaf set, and a pointer will be stored to this
new node. This way, we can ensure that a leaf set will have its storage
balanced fairly well. For cases that a leaf set is completely full, we
use #2 to divert the file. This simply means that the client must
attempt to reinsert the file with a new hash value, in hopes of hitting
a destination node(or leaf set) that can accommodate the file. The
process is somewhat trial-and-error once most nodes have become full.
Another key factor is that some number of k consecutive nodes must be
able to store a copy of the file, or else the insert will fail. This
value of k is a key invariant that PAST uses to maintain file replicas
in case of node failure.
In the next section, caching is briefly discussed. The main point of
caching is that once a node helps route a block for lookup/insert, it
may also save that block locally for future use. In this way, future
queries could be satisfied from the cache instead of forwarding the
request to other nodes. Two different caching policies are discussed -
1) GreedyDual-Size(GD-S), 2) Least Recently Used(LRU). The results show
that GD-S performs slightly better, but the algorithm seems much more
complex.
The last section gives experimental results. Each portion of the system
was tested to demonstrated performance under load. A simulator was used
to model 2250 simultaneous nodes, and two separate, large datasets were
used as input. The tests focused on replication, caching, and file
insertion failures. Overall, each case performed as expected and the
results were meaningful and intuitive. The results showed that a
well-tuned PAST system can achieve global utilization of approximately
98% while failing only 5% of the inserts.
Overall, this system seems to be well-designed and the designers have
thought through many difficult cases to make the system as effective as
possible. However, the system is quite complicated with respect to the
utilization policies and required diversion operations. I wonder if the
system could be simplified while still retaining the strong utilization
properties. Also, their assumptions about machine growth rates (two
orders of magnitude) seemed to be naive. I don't think their system
needed this assumption and they should not have considered this.
This archive was generated by hypermail 2.1.6 : Wed Mar 03 2004 - 12:57:45 PST