From: Raz Mathias (razvanma_at_exchange.microsoft.com)
Date: Wed Feb 18 2004 - 17:17:26 PST
This paper introduced the concept of a distributed virtual memory system. It presented several algorithms and refinements and resulted in a fairly optimized, usable system that exploits the power of the distributed architecture that had, in some cases, better than linear scale-out.
The main problem to be solved in this system is that of memory coherence, that is, the value returned from any read operation is the value written by the last write operation. The authors began by identifying the size of the replicated unit. They decided that because of exiting hardware support and the non-contentious nature resulting from its size, an operating system page would be used. The hardware support would allow the algorithm writer to set the access rights on a page to fault whenever coherency is violated. This allows one to implement the distributed memory system inside the page-fault handler. Next, the authors had to decide whether to use a writeback or invalidation algorithm. They decided to use invalidation because of large number of writes incurred by writeback. Next the space of algorithms was subdivided into static/dynamic page location, and centralized/distributed management.
The first and most basic algorithm presented was that of the central manager. A single node in the network would maintain an ownership table and copy_set table which indicates who "owns" each page for writing, and who is using the page for reading. When a write fault occurs, the faulting process notifies the central server, which then invalidates that page on all readers, sends a copy of the page from the most recent owner to the new writer, and makes the new writer the new owner. On a read fault, the server adds the new reader to the copy_set and asks the current owner to send the reader the page. In order for serialization to work correctly, the page must be kept locked until the read/write servers receive confirmation of successful completion of the activities. Therefore the first optimization presented avoids the confirmation message by moving the locking responsibility from the central server to the owner of the page, who now maintains its own copy_set field. The major problem with these algorithms is the bottleneck introduced by the shared central server's owner table.
The first solution to the problem of locating the owner is to have a fixed distribution method. This can be implemented via hashing or a more complex (and configurable) method could be used. This algorithm has the disadvantage that a significant amount of domain knowledge must be used to partition the nodes effectively across the nodes. Furthermore, adding a node requires a certain amount of complex reconfiguration (the paper does not list this disadvantage). I believe that on a fixed set of nodes, with a well-thought-out mapping function, this solution would be most efficient, although it would incur the largest burden upon application implementer which to some extent defeats the whole purpose of a transparent virtual memory system.
The next solution maintains the ownership information in a distributed set of prob_owner fields. When the owner is searched for, the prob_owner is contacted. If the prob_owner is not the actual owner, then the request is forwarded to the next prob_owner. This prob_owner field is opportunistically updated on invalidation requests, page relinquishing, and fault requests. The paper introduces a simple broadcast mechanism to periodically update all nodes with ownership information. Finally, a refinement is made in which the copy_set is maintained in a distributed fashion, thereby lightening the request load and page-invalidation responsibilities of the owning page.
I really enjoyed the methodical approach to creating a distributed algorithm by identifying the shared resources and distributing them one by one. The synchronization mechanism described in this paper puts into context the different possible approaches to keeping distributed state. This paper takes the two approaches presented by Grapevine (using centralized replication) and Emerald (using a forwarding mechanism to find an object's owner) are both places them on a clean spectrum. Additionally, the distributed single-level store presented by this paper (each process has its page tables selectively replicated, central or distributed algorithms presented) can be contrasted with the approaches in Opal (single replicated address space), Grapevine (distributed objects, central server), and Emerald (distributed objects, forwarding mechanism used to locate).
This archive was generated by hypermail 2.1.6 : Wed Feb 18 2004 - 17:17:27 PST