Lecture 1: Intro; Fault Models — Notes
These are notes from the lecture on March 30, 2026. See also the whiteboard descriptions and the whiteboard PDF.
These materials were drafted by AI based on the live whiteboard PDF and audio transcript from the corresponding lecture and then reviewed and edited by course staff. They may contain errors. Please let us know if you spot any.
Course Overview
Course structure
- The best part of this course is the labs. You will build distributed systems.
- The labs use a model checker (based on formal methods) for fairly exhaustive testing of your implementations. You have access to the model checker while developing, and your lab grades are based on how many tests you pass.
- There is a course website with a syllabus. Office hours will be added to the weekly calendar. There's also a quarter-long schedule under the "Schedule" tab.
Grading
- Additive points system: 2000 points available in the quarter. There is a formula on the syllabus that maps points to a 4.0 scale. You know how you're doing throughout the quarter.
- Over half the points are from the labs.
- In a sense, all assignments are optional — not doing one just means you don't get those points, with no additional penalty.
Design documents
- For each of the four labs, there is an associated design document.
- For lab 1, the design doc and lab are due at the same time. Lab 1 is mostly a tutorial and fairly straightforward; the design doc is mainly to get you in the habit.
- For labs 2, 3, and 4, design documents are due a week before the labs. You will want to write them — they help you think through the protocol before implementing.
- There is a significant amount of writing involved.
Other assignments
- Problem sets: point out interesting edge cases you may not have considered in the labs, and cover additional lecture material.
- At the end of the quarter, we read some papers together and do some writing about those.
W credit
- You can get W (writing) credit for this course by doing all the required writing plus turning in revised copies of your design documents that incorporate feedback.
- This is a 4-credit class, so you'd get 4 W credits.
Partner work and collaboration
- Labs 2, 3, and 4 can be done with a partner (optional). If you're taking the master's version, you must do the labs alone.
- Outside of partner work, sharing code or solutions is an academic honesty violation.
What is a Distributed System?
A distributed system involves:
- Multiple machines — and the machines are faulty
- Connected by a network — and the network is faulty
- Concurrency — machines operate independently, leading to concurrency bugs
Leslie Lamport's definition
Leslie Lamport created many good ideas in distributed systems. One of the things we'll do this quarter is implement Paxos, an algorithm he invented.
His (joke) definition: " A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."
Source: email from 1987
For example, when Amazon's cloud has a problem, thousands of websites go down because so many people use Amazon's cloud.
Why Build a Distributed System?
If we could avoid distributed systems, life would be much easier. So why build a distributed system?
1. Harness the power of multiple machines (horizontal scaling)
If one machine can do X, ten machines can do 10X. This is called horizontal scaling (as opposed to vertical scaling, which is buying a bigger computer).
Vertical scaling is actually a great option when available — it doesn't involve distributed systems. And modern machines are very large: hundreds of cores, terabytes of DRAM. It takes a long time to reach the point where you genuinely need a distributed system purely for compute. But there is an upper bound.
To use horizontal scaling, the problem you are trying to solve has to be able to be partitioned into smaller mostly independent subproblems.
2. Redundancy / replication
Store the same data on multiple machines so that if one fails, the data isn't lost. For example, Google stores data on multiple hard drives — when one fails (and they do fail), the data is still available on another.
Another word for redundancy in distributed systems is replication. We will spend about 70% of the quarter talking about replication.
Note that these two reasons are in some sense opposite: horizontal scaling means having multiple machines do different things; replication means having multiple machines do the same thing.
3. Placing data near users
The speed of light is about one foot per nanosecond. Light takes about 100 milliseconds to circumnavigate the globe. At computer speeds, 100ms is a long time — a video game frame is 16ms. If you have only one copy of your data, some users will be far away. Placing copies around the world reduces latency.
How Hard Is It to Build a Distributed System?
Two fundamental reasons distributed programming is harder than normal programming:
Partial failure
Normally, we think of a computer as either working or not working. In large distributed systems, some component is failing essentially all the time.
Back-of-the-envelope: if a single hard drive fails every 10 years, and you have 1000 machines (roughly early Google), you get a hard drive failure every 3.5 days — twice a week. Today Google has far more machines, so failures happen on the order of every few minutes.
You cannot think like a normal programmer and say "if the hard drive fails, the human has to intervene." Humans can't intervene multiple times every minute. Failure must be treated as normal, and the system must handle it automatically.
Concurrency
Multiple machines operating independently means concurrency is inherent. If you've taken operating systems, you know about concurrency bugs. Distributed systems have those too.
Fault Model
Because failures are normal in distributed systems, it's useful to write down exactly which failures you consider normal. A fault model (also called a failure model) is a list of failures we plan to tolerate automatically — the code will handle them without human intervention.
Normal programmers have the empty fault model: any failure is considered abnormal. We are going to have a non-empty fault model.
What failures are possible?
Machine failures:
- Power goes out — machines crash (shut down ungracefully, in the middle of whatever they were doing). Sometimes it's one machine, sometimes a whole rack or data center.
Network failures:
- Reordering — send message A then B, but B arrives first
- Dropped packets — the most common cause is a router running out of buffer space
- Duplicates — when you try to tolerate drops by retransmitting, the original might just have been slow. Now the message arrives twice. Networks don't duplicate packets; computers duplicate packets because they're trying to tolerate drops.
- Corruption — bit flips from radiation or electrical variance. Most networks have low-level checksums that catch this.
- Message injection — a malicious actor injects packets that were never sent, or redirects packets to the wrong machine
- Delay — packets take unpredictable amounts of time. We won't assume anything about how long delivery takes, beyond the speed of light.
- Unplugging the cable — equivalent to dropping all packets between those nodes
The standard fault model
Not all of these are in our fault model. The more failures you tolerate, the harder it gets. The standard fault model that we'll use for most of the course includes:
- Machine crashes (some, not all — if every machine crashes, there's nowhere to run recovery code)
- Message reordering
- Message drops
- Message delays
- Message duplication
We will not tolerate corruption or malicious intent. Tolerating malicious actors is possible (there are research systems that can function correctly if less than one third of the nodes are compromised) but we won't cover it in this course. Corruption is similar — checksums mostly handle it, and we'll group it with malicious intent for our purposes.