Review: F. Dabek, et al. Wide Area Cooperative Storage with CFS.

From: Richard Jackson (richja_at_expedia.com)
Date: Wed Mar 03 2004 - 16:02:23 PST

  • Next message: Manish Mittal: "Wide Area Cooperative Storage with CFS"

    This 2001 SOSP paper by Dabek, et al describes CFS, which is a
    peer-to-peer(P2P) application that exposes a distributed file system.
    CFS stands for Cooperative File System. This name has two implied
    meanings: 1) nodes must cooperate to route block lookups and insertions,
    2) a subset of the network must corporate to build a single file for a
    given client, as a single file is comprised of many distributed blocks.
     
    This paper is divided into the following sections: 1) CFS overview, 2)
    Chord, 3) DHash, 4) Performance results.
     
    CFS is an P2P system that acts a distributed file system. The key
    components to CFS are : 1) the client, 2) DHash, 3) Chord. In CFS,
    files and directories are represented as blocks, similar to how UNIX
    handles files. The client is responsible for converting a logical file
    system into the appropriate block structure, and then communicating this
    block-level data to DHash. DHash takes this data and distributes it to
    other DHash nodes according to the Chord location facility which is
    discussed later. A client can insert blocks into CFS, and read blocks
    from CFS. All of these operations are done in terms of blocks and their
    associated keys. The entry point for a client is the key of the root
    file system. From there, it is easy to see how a directory structure
    can be traversed and individual files can be reconstructed based on
    their component blocks.
     
    Chord is the location facility that CFS uses for storing blocks. The
    Chord algorithm is the key to most of the success of CFS, as it provides
    the necessary mechanisms for reliability and distribution. The Chord
    logic is based on the notion of consistent hashing. That is, a given
    object will always hash to the same key. Chord uses this property to
    construct a routing scheme that distributes keys over a circular
    namespace based on the keys. This namespace is called a Chord ring, and
    is essentially a sorted list of all node keys, with wrap-around. To
    find the location of a DHash block, Chord finds the node that is the
    successor of that block's ID. This node is responsible for storing that
    block. Chord also maintains two key data structures: 1) successor list,
    2) finger table. The successor list is a list of R immediate successors
    to this node. R is generally chosen to be 2log_base2_N where N is the
    total number of nodes. The finger table is a list of pointers to other
    nodes throughout the ring, such that lookups do not need to depend on
    the slower successor list. However, the finger table is optional and
    the successor list would allow successful lookups.
     
    DHash is a distributed-hash table that is built on top of the Chord
    location infrastructure. DHash communicates with the client, and routes
    block inserts or reads to the DHash node that contains the block. DHash
    is also responsible for replication of blocks. This replication ensures
    that each of the node's successors have a copy of the block. This
    allows the CFS network to operate even if some nodes fail. In order for
    a block to be lost, all of the successors would need to fail
    simultaneously, or within the small recovery window for restoring block
    replication. Another DHash responsibility is caching. This means that
    a DHash node will cache certain blocks that are not explicitly owned by
    that node. These blocks are discovered when routing insert/read
    requests to other nodes. If DHash does not have a copy of this block,
    it will make a copy and use it to speedup future lookups.
     
    Performance results are discussed, and most of the key design
    constraints are challenged. Overall, CFS is shown to scale well with
    the number of total nodes. The advertised number is O(log N), and the
    actual results are 1/2log_base2_N. Other results claim that CFS
    performance is comparable to that of FTP.
     
    Overall, this system seems to be a simpler P2P system than PAST. Both
    systems use consistent hashing as a key design mechanism. PAST has many
    well-engineered features to allow the network to scale to near-100%
    utilization. CFS does not have any of these mechanisms, aside from a
    simple quota system. Overall, I think CFS is weak in this aspect, but
    perhaps PAST is too aggressive and complex. I think a better system
    might be a compromise between the two. The key difference between PAST
    and CFS is that PAST stores entire files while CFS stores blocks.
    Because CFS blocks can contain directory metadata, the logical namespace
    is much more robust than the strictly flat namespace that PAST supports.
    To some extent, this is just semantics, as you could build a CFS-like
    behavior using PAST. A key question I have about both systems is how
    each should be used. PAST seems to have stronger persistence semantics,
    which make the system useful as a general archival resource. CFS does
    not have these strong properties, as blocks will eventually be expired
    unless the owner refreshes them. It seems that CFS is best used for
    short-term sharing of files only.
     
    There are various other aspects to CFS that were not discussed here,
    such as security, Chord speedups, the lookup algorithms, Chord's
    protocols, etc. The details are well-explained in this paper and the
    referenced papers.
     


  • Next message: Manish Mittal: "Wide Area Cooperative Storage with CFS"

    This archive was generated by hypermail 2.1.6 : Wed Mar 03 2004 - 16:02:42 PST