From: Praveen Rao (psrao_at_windows.microsoft.com)
Date: Wed Mar 03 2004 - 17:21:06 PST
CFS is a peer-to-peer read-only storage system. It provides guarantees
for efficiency, robustness, and load-balancing of file storage and
retrieval.
While read-only system sounds like a big limitation, if the files need
not be authored in a distributed fashion, once can easily discard the
old file and insert a changed file into the system, albeit with weak
semantics of change propagation.
Authors list the challenges for P2P architecture:
* it should be symmetric and de-centralized
* should operate well with unmanaged volunteer participants
* finding desired data in a large system must be fast
* servers must be able to join and leave without affecting systems
efficiency/robustness
* system should be load balanced.
The distinguishing feature of CFS is that it distributes blocks over the
nodes as opposed to distributing whole file. CFS servers provide a
distributed hash table (DHash) for block storage. This fine granularity
provides CFS better load balancing. DHash supports pre-fetching of
blocks to decrease latency.
At the heart of the system is this distributed hash which determines
which node a block should go to.
The core consists of two layers - DHash and Chord. DHash layer performs
block fetches for the client, distributes the blocks among servers and
maintains caches and replicated copies.
Chord implements hash like lookup that maps block identifiers to
servers. Chord assigns each server identifier drawn from same 160-bit
identifier space as block identifiers. The mapping takes a block's ID
and yields the block's successor; the server whose ID most closely
follows the block's ID.
CFS adopts naming, authentication and file system ideas from SFSRO.
CFS does not try to provide anonymity and authors argue that that can be
layered on top of CFS using encryption and secret sharing. Another
alternative is to use anonymizing proxies. The question would be - after
these layers are added, what is the system performance/robustness to
really be able to look at it in its entirety.
CFS stores data for an agreed-upon finite interval, which sounds like a
leased based system. This simplifies distributed garbage collection.
There is no explicit delete operation.
If Chord nodes could use arbitrary IDs a malicious node could insert
itself and discard data. To avoid this, a Chord node id is an SHA-1 hash
function of its ip-address concatenated with a virtual node index. The
virtual node index is between 0 and a small maximum. Thus a node can't
easily control its ID.
DHash caches blocks to avoid overloading servers that hold popular data.
Each DHash layer sets aside a fixed amount of storage for the cache. The
servers on a lookup path cache the block (client sends a copy to them).
CFS allows updates but in a way that allows only a publisher to update
it. Root block is protected by hash checking and root block is the only
sensitive block. I am not clear on how root block is the only sensitive
block.
This archive was generated by hypermail 2.1.6 : Wed Mar 03 2004 - 17:20:54 PST