This course concerns the theory and practice of building secure, robust, efficient and evolvable distributed systems. Distributed systems are appearing at all granularities, from planetary scale web services such as Akamai, Ebay and Google, to distributed databases for managing multibillion dollar businesses, to massively parallel multiplayer games, to large scale sensor networks. In each case, there is a need for a deep understanding of fundamental principles if we are to achieve the desired system-level properties.
Project: The best way to learn how to build distributed systems is by practice, and so the core of the course is a substantial distributed system design and implementation project. More details regarding the project.
Collaboration/Cheating: An observation guiding the design of this course is that students typically learn more from each other than they do from the faculty. Thus, I encourage you to collaborate with your classmates, including those from other groups, in all aspects of the course, with the exception of homework and the final, which are to be done individually. The grading in the class is emphatically not curved; I would like nothing better than for all of you to pass.
Readings: The reading list is a mixture of theory and systems papers. It is a requirement that you do the reading before class, as we will take the research papers as a starting point, not the end point of the class discussion. A goal for each class meeting will be to identify the limits of the research community’s knowledge. What are the open research problems for that topic? What are the limits of the applicability of the techniques involved?
Discussion and Blogs: Some of the papers are foundational or surveys; these are marked “no blog”. For the others, where there is some potential for controversy, I'll appoint two people: an advocate and a skeptic, each to speak for 2-3 minutes about the paper. The advocate gives the elevator pitch for the paper: what is the topic of the paper, what are its main results, why the system or idea is better than its competition, why does the result matter and who are the target users. Note that the answers to these questions may be different from the ones the authors gave in the paper! The skeptic speaks on the limits of the paper: why you should be careful in applying the results. Everyone else (except these two) are to read paper and blog a unique comment about the paper to the course web site no later than noon on the day of the class. (This is to give time for everyone to read the blog entries before class.) Note that the earlier you post, the easier it is to be unique. Please keep blog entries short: they can be anything that provides insight into the paper, e.g., a summary, the broader context for the work, a question about some aspect of the paper, an answer to someone else's question, a methodogical flaw, or a pointer to related work not described by the paper, etc.
Final: A take home final will be handed out at 5pm on Friday, June 1, and will be due back at 5pm on June 8. So that students don’t take the entire final exam period working on it, the exam is designed to take a day to think about and no more than two hours to write. Students may take any contiguous 48 hour period during that period of the 1st 8th to work on the final, on the honor system.
Grading: Blogs/Class Discussion: 20%; Homework: 10%; Project: 40%; Final: 30%.
Mailing list: Please join the course mailing list as soon as possible.
Papers
|
Date |
Topic |
Reading |
Discussion leaders |
1 |
3/27 |
Motivation |
Gray, Distributed Computing Economics, Microsoft TR, 2003. O’Reilly, What is Web 2.0?, Sept 2005 |
|
2 |
3/30 |
Basics |
Birman, Reliable Distributed Systems, Chapter 4, RPC (no blog) W3 Schools, Introduction to WSDL (no blog) J. Ousterhout, The Role of Distributed State, 1991. |
A: Nick S: Ivan |
3 |
4/3 |
Clocks and Snapshots (slides) |
Lamport, Time, Clocks and the Ordering of Events in a Distributed System, CACM 1978. Chandy and Lamport, Distributed Snapshots: Determining the Global States of a Distributed System, TOCS 1985. |
A: Tanya S: Ivan A: John S: Bartosz |
4 |
4/6 |
Global States |
Baboaglu and Marzullo, Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms, 1993. (no blog) Geels, Altekar, Maniatis, Roscoe, Stoica. Friday: Global Comprehension for Distributed Replay, NSDI 2007. |
A: Ivan S: Colin |
5 |
4/10 |
Causal Ordering (Steve Gribble) |
Cheriton and Skeen, Understanding the Limitations of Causally and Totally Ordered Multicast, SOSP 1993. (one blog for both papers) Birman, A Response to Cheriton and Skeen’s Criticism of Causal and Totally Ordered Communication, October 1993. |
A: Bartosz S: Dan |
6 |
4/13 |
Transactions: 2PC and 3PC (Magda) (slides) |
Bernstein, Hadzilacos, and Goodman. Distributed Recovery. Chapter 7 in Concurrency Control and Recovery in Database Systems. (no blog) |
|
7 |
4/17 |
Rollback Recovery |
Elnozahy et al., A Survey of Rollback Recovery Protocols in Message Passing Systems, ACM Computing Surveys, 2002. |
A: Ivan S: Ethan |
8 |
4/20 |
State Machines |
Schneider. Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. ACM Computing Surveys, 1990. |
A: Colin S: John |
9 |
4/24 |
Impossibility of Consensus |
Fisher, Lynch, and Paterson. Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985. |
A: Ethan S: Stefan |
10 |
4/27 |
Paxos (slides) |
Lamport, The Part Time Parliament, TOCS, 1998. Lamport, Paxos Made Simple, ACM SIGACT News, 2001. (no blog) |
A: Harsha S: John |
11 |
5/1 |
Byzantine Fault Tolerance |
Lamport, Shostak, and Pease, The Byzantine Generals Problem, ACM TOPLAS, 1982. (no blog) Castro and Liskov, Practical Byzantine Fault Tolerance, OSDI 1998. |
A: Tomas S: Nick |
12 |
5/4 |
Weak consistency |
Adve and Gharacharloo, Shared Memory Consistency Models: A Tutorial, DEC WRL TR, 1995. Demers et al., Epidemic algorithms for replicated database maintenance, PODC 1987. |
A: Stefan S: Roxana A: Dan S: Bartosz |
13 |
5/8 |
Replica Management |
Petersen et al., Flexible Update Propagation for Weakly Consistent Replication, SOSP 1997. Belaramani et al., PRACTI Replication, NSDI 2006. |
A: Roxana S: Alex A: Harsha S: Mike |
14 |
5/11 |
Scalable Services |
Ghemawat et al., The Google File System, SOSP 2003. Birrell et al., Experience with Grapevine: The growth of a distributed system. TOCS 1994. |
A: YongChul S: Roxana A: Tomas S: Colin |
15 |
5/15 |
DHTs |
Maymoukov and Mazieres, Kademlia: A P2P Information System Based on the XOR Metric, IPTPS 2002. Dabek et al., Designing a DHT for Low Latency and High Throughput, NSDI 2004. |
A: Stefan S: Alex A: YongChul S: Mike |
16 |
5/18 |
|
Ed Felten talk, 1:30 PM @ CSE 403 |
|
17 |
5/22 |
DHTs + Storage |
Dabek et al., Wide Area Cooperative Storage with CFS, SOSP 2001. Yu et al., Availability of Multi-Object Operations, NSDI 2006. |
A: Alex S: Harsha A: John S: Ethan |
18 |
5/25 |
Incentives |
Walsh and Sirer. Experience with an Object Reputation System for Peer-to-Peer Filesharing. NSDI 06. Aiyer et al., BAR Fault Tolerance for Cooperative Services, SOSP 2005. |
A: Mike S: Dan A: Ethan S: Mike |
19 |
5/29 |
Distributed Security |
Lampson et al., Authentication in Distributed Systems: Theory and Practice, TOCS 1992. |
A: Tanya S: Roxana |
20 |
6/1 |
Distributed Security |
Appel and Felten, Proof Carrying Authentication, CCS 1999. Lampson, Computer Security in the Real World, 2001. |
A: Colin S: Stefan A: Alex S: Tanya |