Lecture 1: Intro; Mechanics

Welcome to CSE 452!

Intro to Distributed Systems

Definitions

What do we mean by the phrase "distributed system"?

We want to make a set of multiple computers work together with these properties:
- reliably (the system produces correct results in a timely manner)
- efficiently (using resources well, and using a reasonable amount of resources)
- at huge scale (serving billions of users (6 billion smart phone users globally!), ideally "just add more computers" to serve more users)
- with high availability (almost never "down", "lots of nines")

Why might we want this?

Scale. Early websites started with one server doing everything.
- As you get more users, you eventually run out of resources on the one server.
- How to scale further? Distributed systems. (sharding)
Availability. In many domains, if the system is unavailable, we lose money (or worse).
- If we run our system on one machine, and it needs to be upgraded or repaired, we have to take it down temporarily.
- How to get more availability? Distributed systems. (replication)
Locality: do computation near users to decrease latency
Or because our computation is inherently geographic

Lamport's definition of a distributed system (half joking)

"A distributed system is one where you can't get your work done because some machine you've never heard of is broken."
(Leslie Lamport is a Turing award winner for his groundbreaking work in distributes systems. Most of the first half of the class is stuff invented by him.)

We've made some progress since Lamport's joke.

Today, we think of a distributed system as one where you can get your work done (almost always):

wherever you
whenever you want
even if some components in the system's implementation are broken
no matter how many other people are using the system
as if it was a single dedicated system just for you

How is that possible? We will learn :)

Another definition

Distributed systems are characterized by concurrency and partial failure
- Concurrency: doing more than one thing
- Partial failure: some parts are not working even though others are not

Concurrency

Concurrency is fundamental to building systems used by multiple people to work on multiple tasks.

This is a tautology, but it has important consequences for many types of systems
Operating Systems (CSE 451)
- A single computer is used by multiple users, and each user runs multiple processes.
- If one program has a null pointer exception, the entire computer does not crash.
Networked systems (CSE 461)
- Connect multiple computers together, sharing network resources
- A network is a kind of distributed system
Database Systems (CSE 444)
- How to manage (big) data reliably and efficiently, accessed by multiple users (transactions)
- Lots to worry about (including concurrency!) in the single-node setting, but there are also distributed databases.

Partial failure

If we're Google and we have a million machines, some of them are going to be broken at any given time.
But we still want to get our work done.

How to serve a billion clients?

Straw proposal

(A straw proposal is one intended to illustrate its disadvantages.)

Just pay Intel/Apple/IBM/DEC/whoever to build a single really big computer that can serve all of our clients.
This is pretty much how we did it for the first 60 years of computing up to the late 90s!
- Had some great help from our friends the computer architects. We just wait a few years and:
  - Our system would get faster because CPUs get more sophisticated (smaller transistors -> can use more of them to implement CPU)
  - Our system would get faster because clock frequencies go up (smaller transistors -> smaller chips -> less speed-of-light delay across chip -> can increase clock)
  - Our system would get more power efficient (smaller transistors require less power, it turns out)
But this doesn't work any more, for a few reasons:
- First, many of the architectural free lunches are over.
  - Clock speeds are constant, power savings have run out, transistors getting smaller much more slowly.
  - (In fact, lots of exciting recent computer architecture is about how to scale chips further by making them more like networked systems!)
- Second, and more importantly, to serve billions of clients, there is no single computer anywhere close to the horizon that can handle this.
  - The biggest "single" machines are incredibly expensive (e.g., national labs super computers), and if the machine crashes everybody is out of luck, so they have to be manufactured to higher-than-normal standards.
  - So the cheapest way to build these big "single" machines is usually to treat them more like distributed systems of lots of smaller components.

So this straw proposal doesn't really work. What should we do instead to serve billions of clients?

Using more than one machine

We simply have too much work to be done on a single machine. Let's think through using more than one machine.

If we had 10 servers, we could have each of them be responsible for a tenth of the clients.

If we add more servers, can handle even more clients.

-> To serve a huge number of clients requires many components working together.

Suppose we decide we need 1000 servers to server our clients (small by today's standards)

And suppose each of our server experiences a hard drive failure about once every 3 years.
Then across our fleet, we expect 1 hard drive failures per day

-> Individual component failures are common in large-scale systems.

If each of these failures had to be handled manually, we'd be working around the clock just to keep the system up.

-> Large-scale systems must tolerate failures automatically.

The primary technique we will study for tolerating failures is replication.

The idea is that you run two or more copies of your system at the same time
That way, if some of the copies fail, the other copies are still around and clients can make progress.

-> Failures can be handled automatically through replication.

While replication is great for fault tolerance, it has a couple negative consequences.

First, we need a lot more machines! (At least a factor of 2 more.)
- And they should probably be in different cities so that all the replicas don't get taken out once
  - (And that means its going to take a lot longer for messages to get between replicas)
Second, and more importantly, replication introduces a new obligation on us, the system designers:
- We must keep the different replicas in sync with each other. (We call this consistency.)
- Otherwise, clients could observe contradictory or wrong results when interacting with different replicas.

-> Any system that uses replication must have a story for how the replicas are kept consistent, so that clients are blissfully unaware that our service is replicated internally.

But, this consistency strategy has other consequences:

-> Consistency protocols introduce significant performance overhead (additional messages, delays, etc.)

In fact, we may find that the performance of our consistent replicated system (that uses 2000 or more servers) is so much worse than our original unreplicated (not fault tolerant) 1000 servers plan, that we need to again increase the number of servers, say by another factor of 2, to get back that performance.

The fundamental tension

To summarize our discussion above, here is the fundamental tension in distributed systems:

To serve many clients, need many servers.
With many servers, failures are common.
To tolerate failures, replicate.
Replication introduces possibility of inconsistency.
Implement consistency protocol.
Consistency protocol is slow; can't serve all our clients.
Add more servers...
Even rarer failures become common...
Increase replication factor...
Consistency is even more expensive to achieve...
Performance is still unacceptable...
Add even more servers...

In other words:

-> There is a tradeoff between fault-tolerance, consistency, and performance.

We will study several ways of navigating this tradeoff throughout the quarter.

Give up on fault tolerance: use sharding to get high performance, no replication means no consistency worries.
Give up on consistency: can use a cheaper consistency protocol to get better performance while retaining fault tolerance.
Give up on performance: use expensive consistency protocol and assume workload will be relatively low (often makes sense for "administrative" parts of a system).

Challenges

Why is this any different from "normal" programming?

All of us already know how to program. Isn't it just a matter of running a program on multiple machines?
Yes and no, but mostly no :)

Remember: concurrency and partial failure.

Concurrency is hard in any scenario.

Partial failure is also hard.

machines crash
machines reboot (losing contents of memory)
disks fail (losing their contents)
networks packets are delayed and lost
network itself goes down in part or whole
machines misbehave (have bugs, get hacked)
- send messages they aren't supposed to
networks misbehave (have bugs, get hacked)
- corrupt packet contents in transit
- inject packets that were never sent
people make mistakes
- misconfigure machines
- misconfigure network

It's super important when designing a system to be really clear about what failures you are trying to handle automatically, and what failures you are willing to handle manually.

Both parts are important! What you leave out means a lot.

We will call this choice a "failure model" or "fault model".

Common failure models
- Asynchronous unreliable message delivery with fail-stop nodes:
  - The network can arbitrarily delay, reorder, drop, and duplicate messages.
  - Nodes can crash, but if they do, they never come back.
  - This is the model for the labs
- Asynchronous reliable in-order message delivery with fail-stop nodes:
  - The network can delay messages arbitrarily, but they are delivered in order and eventually exactly once.
  - Nodes same as above
  - This is the model for some important distributed algorithms we will study but not implement.
  - Can be implemented on top of the previous model with sequence numbers and retransmission.
- "Byzantine" models refer to models that allow misbehavior
  - For example:
    - The network delivers a message to N2 as if it was sent by node N1, but N1 never sent that message.
    - The network corrupts a message.
    - A node does not follow the protocol (it got hacked, or cosmic ray, or something)
  - We will not study these models much, but fun fact:
    - Surprisingly, you can actually build systems that tolerate some number of Byzantine failures

Want to be correct under all possible behaviors allowed by our fault model.

Distributed systems are usually very challenging to test because there are many combinations of failures to consider.
- (This course does a particularly good job at testing!)

Course Mechanics

Intros

I'm James
We have TAs

Teaching philosophy

To understand X, build an X
Systems <-> Theory

Syllabus items

Read the course website.

Big picture course calendar:

distributed systems fundamentals: how to think about global state (4 weeks of lecture, labs 1 and 2, problem sets 1-3)
real-world distributed systems problems: replication, sharding, caching, and atomic transactions (4 weeks, labs 3 and 4, problem sets 4-6)
research papers in distributed systems: how people have built practical systems (2 weeks, reading 6 research papers and writing 6 blog posts)

Remote Procedure Call (RPC)

Executive Summary

RPCs allow nodes to call functions that execute on other nodes using convenient syntax
The key difference from a local function call is what happens when things fail
To tolerate failures, use sequence numbers and retransmissions

Intro to RPC

What is it?

It's a programming model for distributed computation
"Like a procedure call, but remote"
- The client wants to invoke some procedure (function/method), but wants it to run on the server
To the client, it's going to look just like calling a function
To the server, it's going to look just like implementing a function that gets called
Whatever RPC framework we use will handle the work of actually calling the server's implementation when the client asks
For context, Google does about \(10^{10}\) RPCs per second

Local procedure call recap

Remember roughly how local function calls work:
- In the high-level language (e.g., C :joy:), we write the name of the function and some expressions to pass as arguments
- The compiler orchestrates the assembly code to:
  - in the caller, evaluate the arguments to values
  - arrange the arguments into registers/the stack according to the calling convention
  - use a call instruction of some kind to jump to the label of the callee function
    - caller instruction pointer (return address) is saved somewhere, typically on the stack
  - callee executes, can get at its arguments according to calling convention
    - callee might call other functions
  - callee eventually returns by jumping back to return address, passing returned values according to calling convention
- C programmers rarely have to think about these details (good!)

RPC workflow

Key difference: function to call is on a different machine
Goal: make calling remote functions as easy as local functions
- Want the application programmer to be able to just say f(10) or whatever, and have that automatically invoke f on a remote machine, if that's where f lives.
Mechanism for achieving this: an RPC framework that will send request messages to the server that provides f and response messages back to the caller
- Most of the same details from the local case apply
- Need to handle arguments, which function to call (the label), where to return to, and returned values.
- Instead of jumping, we're going to send network messages.