Review: Scale and Performance in a Distributed File System

From: Raz Mathias (razvanma_at_exchange.microsoft.com)
Date: Mon Feb 23 2004 - 16:50:43 PST

  • Next message: Manish Mittal: "Scale and Performance in a Distributed File System."

    This paper described improvements in the Andrew File System, a distributed file system created at Carnegie Mellon University.

     

    The paper begins by presenting the original system and quantitatively demonstrating its weaknesses. The original AFS consisted of two basic concepts, Venus, the set of client processes, and Vice, the set of server processes. A particular process on the Vice server would listen on a port and spawn off a new process in response to each incoming client. Stub directories were used to abstract files that existed on other servers. Because the location information was stored inside the stub directory moving files/directories between servers was an extremely expensive operation since every stub pointing to those files/directories would need to be updated. There was no separation of name from underlying resource (pathname vs. inode). This resulted in inefficient over-the-network path traversal whenever a file was opened.

     

    To improve performance a number of observations were made about data that came from a series of benchmarks. First, too many validity checks were made by clients when opening/closing files resulting in unnecessary CPU cycles on the server. Next, context switch time was dominating server execution. Pathname traversal was becoming a real performance bottleneck because of the increased network traffic. Finally, the load could not easily be balanced across servers because of the stub architecture. The optimizations made were in four basic areas, cache management, name resolution, communication/process structure, and low-level storage implementation.

     

    The major innovation in cache management was the notion of callbacks. Callbacks place the burden of identifying when files change on the server rather than on the clients, paradoxically lessening the load on the servers. The reason this happens is that there is a heavy tradeoff involved when placing this responsibility on the clients. The clients either suffer from a lack of transparency (because they poll the server too few times and therefore do not have "live" information on whether the data they have is current), or they overburden the server with polling calls resulting in better information at the client. By placing the responsibility on the Vice, Venus an effective tradeoff is made resulting in better server CPU/network utilization without paying the price of reduced network transparency. Caching is utilized at the clients at the file level, i.e. reads and writes do not propagate to the server. An LRU algorithm is used to drive the cache-replacement policy. When compared with the page-level caching in Sun's NFS implementation, the file-level caching was found to be more performant with fewer complexities.

     

    In the area of name resolution, the idea of inodes was reintroduced to separate the name of a resource from its address. A unit of data movement called the "volume" (intended to be about the average size of a single user's data) was created to allow simple administration of data movement. This allowed for better load balancing in terms of processing and disk space. As an administrative structure, volumes also allowed for simplified implementation of disk quotas and read-only replication.

     

    As far as process structure, the paper decided to implement Lightweight Processes (LWP or threads) to reduce the context switch time, reducing the processing burden on the server.

     

    Finally, the improvements in low-level storage through the use of inodes allowed for location- and path-independent file representations, i.e. the resource itself was separated from its name and location. Callbacks in combination with caching of inodes were used to reduce the path-traversal load on the network and the server processing.

     

    AFS's replication granularity can be contrasted somewhat with the distributed virtual paging mechanism we read about earlier. This paper makes the argument that large-grained replication is the way to go for files. This paper can also be seen as an extension to the RPC mechanism introduced in Grapevine as a useful system built upon RPC is illustrated. Finally, the system can be contrasted with Opal or Emerald which use a single-layer approach to distribution whereas AFS uses the concept of files, separated from memory, for distribution and replication.

     

    In summary, this paper was a nice overview of the critical issues involved in building distributed file systems.

     


  • Next message: Manish Mittal: "Scale and Performance in a Distributed File System."

    This archive was generated by hypermail 2.1.6 : Mon Feb 23 2004 - 16:50:44 PST