Review of Distributed File Systems paper

From: Praveen Rao (psrao_at_windows.microsoft.com)
Date: Mon Feb 23 2004 - 08:47:32 PST

  • Next message: Tarik Nesh-Nash: ""Scale and Performance in a Distributed File System" Review"

    This paper discusses a location transparent file system named Andrew.
    The salient feature of this system is whole file caching.

    Andrew distributes single name space file system for all the nodes.
    There is a set of distributed servers which are called Vice. OS
    intercepts file system calls and forwards them to user-level process
    called Venus. I wasn't clear on why it was done is user mode. Towards
    the end of the paper authors cite further perf improvements possible by
    moving the system into kernel mode, which indicates that this was done
    for simplicity to start with. The system is built over BSD Unix.

    Venus caches files (whole files) from Vice. Vice is only contacted on
    open/close. read/write of individual bytes are directly performed by
    Venus. For scalability reasons the design makes Venus do more work and
    Vice do very little work for a particular node.

    Authors first discuss a prototype implementation and its shortcomings
    and then they suggest and implement the improvements. In the prototype
    cache was always treated as suspect, hence on every file open the
    timestamp of the file is checked against the server to ensure that the
    cached copy is current. The servers have dedicated processes for
    clients. This screamed a bottleneck in my mind and authors discovered it
    too. Dedicated process per client limits the number of clients. Context
    switching overhead and high paging demands make this approach
    impractical.

    The client maintains two caches - file cache and status cache. The cache
    hit ratios were quite good ~80% for both the caches. Server load peaked
    at about 5 - 10 clients. The servers had high CPU utilization but low
    disk utilization, which means the system was compute bound even though
    it was a distributed file system. High CPY utilization was caused by the
    context switching and full path name traversals. The experimentation
    with the prototype revealed the needs for the following improvements:
    1) Reduce cache validity checks
    2) Reduce the number of server processes
    3) Require workstation to do path name traversals
    4) Balance server usage by re-assigning users

    Authors set out to design a full fledged system with incorporating these
    improvements. They were convinced of the validity of the architecture
    though and kept the same fundamental architecture wherein workstations
    cache whole files from servers. They incorporated the following perf
    changes:
    - cache mgmt: it was changed to invalidation kind of protocol where
    server calls back clients when changes occur
    - Name resolution: in unix namei maps name to inode, which is one of the
    most utilized functions in kernel. In case of Andrew, all the mapping
    done on the server caused high CPU utilization on the server. This was
    changed to Fids to make server efficient
    - communication and server process structure: LWPs used instead of
    processes (switching cost dramatically reduces) moreover these are used
    as a pool
    - low-level storage representation: efficient fid to vnode lookup
    (analogous to inode in kernel)

    Authors discuss the file consistency guarantees that should be provided
    by the system. Choices range from DB like serializability to SUN NFS
    like laissez-faire attitude where other nodes do not see the changes
    done by one node for 30 seconds. Authors set for the following
    guarantees:
    -writes by one process are visible to all other process on a WS
    -once a file is closed, changes are visible to new opens anywhere on the
    network, existing opens do not reflect the changes
    -other changes (e.g. protection changes) are immediately visible
    everywhere
    -multiple WSs can perform the same operation of a file concurrently

    An important observation was that despite caching fetches dominate
    stores.

    Limitations of Andrew file systems:
    - WS disks should be fast, files can't be bigger than WS disks
    - Strict BSD emulation of read/write is not possible as they are not
    intercepted
    - A distributed database can't be built on top of such a file system

    Over a load of 3, Andrew handily out-performs Sun NFS. Latency when the
    file is not in the cache is higher in case of Andrew.

    Next, authors discuss the operability of the system.
    Prototype shortcomings:
    1) Vice was contructed out of collection of files glued by Mount, hence
    only entire disk partitions could be
    2) Embedding file location in file storage structured tied file to a
    server
    3) No quota system
    4) Mechanisms for file location and replication were cumbersome
    5) No special tools
    6) Backup was difficult

    To alleviate this Andrew uses Volume - which is a collection of files
    forming a partial sub-tree of Andrew name space.
    Volumes are glued together at Mount points.
    Quotas are volume based - each user is assigned a volume.
    Read-only replication is used.

    Authors state that volumes are flat namespaces. I am not clear how.

    Performance of this design was significantly better than the commercial
    systems like Sun NFS. The major shortcoming IMO is not being able to
    provide BSD like file semantics.


  • Next message: Tarik Nesh-Nash: ""Scale and Performance in a Distributed File System" Review"

    This archive was generated by hypermail 2.1.6 : Mon Feb 23 2004 - 08:47:40 PST