CFS review

From: Praveen Rao (psrao_at_windows.microsoft.com)
Date: Wed Mar 03 2004 - 17:21:06 PST

  • Next message: Slavik Krassovsky: "F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, I. Stoica. Wide Area Cooperative Storage with CFS."

    CFS is a peer-to-peer read-only storage system. It provides guarantees
    for efficiency, robustness, and load-balancing of file storage and
    retrieval.

    While read-only system sounds like a big limitation, if the files need
    not be authored in a distributed fashion, once can easily discard the
    old file and insert a changed file into the system, albeit with weak
    semantics of change propagation.

    Authors list the challenges for P2P architecture:
    * it should be symmetric and de-centralized
    * should operate well with unmanaged volunteer participants
    * finding desired data in a large system must be fast
    * servers must be able to join and leave without affecting systems
    efficiency/robustness
    * system should be load balanced.

    The distinguishing feature of CFS is that it distributes blocks over the
    nodes as opposed to distributing whole file. CFS servers provide a
    distributed hash table (DHash) for block storage. This fine granularity
    provides CFS better load balancing. DHash supports pre-fetching of
    blocks to decrease latency.

    At the heart of the system is this distributed hash which determines
    which node a block should go to.

    The core consists of two layers - DHash and Chord. DHash layer performs
    block fetches for the client, distributes the blocks among servers and
    maintains caches and replicated copies.

    Chord implements hash like lookup that maps block identifiers to
    servers. Chord assigns each server identifier drawn from same 160-bit
    identifier space as block identifiers. The mapping takes a block's ID
    and yields the block's successor; the server whose ID most closely
    follows the block's ID.

    CFS adopts naming, authentication and file system ideas from SFSRO.

    CFS does not try to provide anonymity and authors argue that that can be
    layered on top of CFS using encryption and secret sharing. Another
    alternative is to use anonymizing proxies. The question would be - after
    these layers are added, what is the system performance/robustness to
    really be able to look at it in its entirety.

    CFS stores data for an agreed-upon finite interval, which sounds like a
    leased based system. This simplifies distributed garbage collection.
    There is no explicit delete operation.

    If Chord nodes could use arbitrary IDs a malicious node could insert
    itself and discard data. To avoid this, a Chord node id is an SHA-1 hash
    function of its ip-address concatenated with a virtual node index. The
    virtual node index is between 0 and a small maximum. Thus a node can't
    easily control its ID.

    DHash caches blocks to avoid overloading servers that hold popular data.
    Each DHash layer sets aside a fixed amount of storage for the cache. The
    servers on a lookup path cache the block (client sends a copy to them).

    CFS allows updates but in a way that allows only a publisher to update
    it. Root block is protected by hash checking and root block is the only
    sensitive block. I am not clear on how root block is the only sensitive
    block.


  • Next message: Slavik Krassovsky: "F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, I. Stoica. Wide Area Cooperative Storage with CFS."

    This archive was generated by hypermail 2.1.6 : Wed Mar 03 2004 - 17:20:54 PST