Lecture 1: Intro; Fault Models — Whiteboard Descriptions
These are text descriptions of the whiteboard PDF from this lecture.
These materials were drafted by AI based on the live whiteboard PDF and audio transcript from the corresponding lecture and then reviewed and edited by course staff. They may contain errors. Please let us know if you spot any.
What is a Distributed System?
- Multiple machines
- machines are faulty
- concurrency!
- Connected by a network
- the network is faulty
- Leslie Lamport
Why?
- Harness the power of multiple machines
- Horizontal scaling
- Redundancy / replication
- Fault tolerance
- Placing data near users
How hard is it to build a Distributed System?
- Hard to maintain "coherence"
- replication makes updates hard
- Multiple machines working — one could fail
- partial failure
- Concurrency
Fault Model
A fault model is a list of failures we plan to tolerate — tolerate automatically.
What failures are possible?
- Power goes out — machines crash
- Network faults:
- reordering (standard fault model)
- dropped (standard fault model)
- duplicate (standard fault model)
- delay (standard fault model)
- corruption
- message injection
- unplug the cable — drop
- Machine crashes (standard fault model)
Items marked as "standard fault model" are the failures included in the standard fault model used throughout this course.