# Lecture 1: Intro; Mechanics Welcome to CSE 452! ## Intro to Distributed Systems ### Definitions #### What do we mean by the phrase "distributed system"? - We want to make a set of multiple computers work together with these properties: - **reliably** (the system produces correct results in a timely manner) - **efficiently** (using resources well, and using a reasonable amount of resources) - at huge **scale** (serving billions of users (6 billion smart phone users globally!), ideally "just add more computers" to serve more users) - with high **availability** (almost never "down", "lots of nines") #### Why might we want this? - Scale. Early websites started with one server doing everything. - As you get more users, you eventually run out of resources on the one server. - How to scale further? Distributed systems. (sharding) - Availability. In many domains, if the system is unavailable, we lose money (or worse). - If we run our system on one machine, and it needs to be upgraded or repaired, we have to take it down temporarily. - How to get more availability? Distributed systems. (replication) - Locality: do computation near users to decrease latency - Or because our computation is inherently geographic #### Lamport's definition of a distributed system (half joking) - "A distributed system is one where you can't get your work done because some machine you've never heard of is broken." - (Leslie Lamport is a Turing award winner for his groundbreaking work in distributes systems. Most of the first half of the class is stuff invented by him.) #### We've made some progress since Lamport's joke. Today, we think of a distributed system as one where you can get your work done (almost always): - wherever you - whenever you want - even if some components in the system's implementation are broken - no matter how many other people are using the system - as if it was a single dedicated system just for you How is that possible? We will learn :) #### Another definition - Distributed systems are characterized by concurrency and partial failure - Concurrency: doing more than one thing - Partial failure: some parts are not working even though others are not #### Concurrency Concurrency is fundamental to building systems used by multiple people to work on multiple tasks. - This is a tautology, but it has important consequences for many types of systems - Operating Systems (CSE 451) - A single computer is used by multiple users, and each user runs multiple processes. - If one program has a null pointer exception, the entire computer does not crash. - Networked systems (CSE 461) - Connect multiple computers together, sharing network resources - A network is a kind of distributed system - Database Systems (CSE 444) - How to manage (big) data reliably and efficiently, accessed by multiple users (transactions) - Lots to worry about (including concurrency!) in the single-node setting, but there are also distributed databases. #### Partial failure - If we're Google and we have a million machines, some of them are going to be broken at any given time. - But we still want to get our work done. ### How to serve a billion clients? #### Straw proposal (A [straw proposal](https://en.wikipedia.org/wiki/Straw_man_proposal) is one intended to illustrate its *dis*advantages.) - Just pay Intel/Apple/IBM/DEC/whoever to build a single really big computer that can serve all of our clients. - This is pretty much how we did it for the first 60 years of computing up to the late 90s! - Had some great help from our friends the computer architects. We just wait a few years and: - Our system would get faster because CPUs get more sophisticated (smaller transistors -> can use more of them to implement CPU) - Our system would get faster because clock frequencies go up (smaller transistors -> smaller chips -> less speed-of-light delay across chip -> can increase clock) - Our system would get more power efficient (smaller transistors require less power, it turns out) - But this doesn't work any more, for a few reasons: - First, many of the architectural free lunches are over. - Clock speeds are constant, power savings have run out, transistors getting smaller much more slowly. - (In fact, lots of exciting recent computer architecture is about how to scale chips further by making them more like networked systems!) - Second, and more importantly, to serve billions of clients, there is no single computer anywhere close to the horizon that can handle this. - The biggest "single" machines are incredibly expensive (e.g., national labs super computers), and if the machine crashes everybody is out of luck, so they have to be manufactured to higher-than-normal standards. - So the cheapest way to build these big "single" machines is usually to treat them more like distributed systems of lots of smaller components. So this straw proposal doesn't really work. What should we do instead to serve billions of clients? #### Using more than one machine We simply have too much work to be done on a single machine. Let's think through using more than one machine. If we had 10 servers, we could have each of them be responsible for a tenth of the clients. - If we add more servers, can handle even more clients. - (This idea of splitting work across machines is called "sharding".) -> To serve a huge number of clients requires many components working together. Suppose we decide we need 1000 servers to server our clients (small by today's standards) - And suppose each of our server experiences a hard drive failure about once a year. - Then across our fleet, we expect 3 hard drive failures per day -> Individual component failures are common in large-scale systems. If each of these failures had to be handled manually, we'd be working around the clock just to keep the system up. -> Large-scale systems must tolerate failures automatically. The primary technique we will study for tolerating failures is replication. - The idea is that you run two or more copies of your system at the same time - That way, if some of the copies fail, the other copies are still around and clients can make progress. -> Failures can be handled automatically through replication. While replication is great for fault tolerance, it has a couple negative consequences. - First, we need a lot more machines! (At least a factor of 2 more.) - And they should probably be in different cities so that all the replicas don't get taken out once - (And that means its going to take a lot longer for messages to get between replicas) - Second, and more importantly, replication introduces a new obligation on us, the system designers: - We must keep the different replicas in sync with each other. (We call this consistency.) - Otherwise, clients could observe contradictory or wrong results when interacting with different replicas. -> Any system that uses replication must have a story for how the replicas are kept consistent, so that clients are blissfully unaware that our service is replicated internally. But, this consistency strategy has other consequences: -> Consistency protocols introduce significant performance overhead (additional messages, delays, etc.) In fact, we may find that the performance of our consistent replicated system (that uses 2000 or more servers) is so much worse than our original unreplicated (not fault tolerant) 1000 servers plan, that we need to again increase the number of servers, say by another factor of 2, to get back that performance. #### The fundamental tension To summarize our discussion above, here is the fundamental tension in distributed systems: - To serve many clients, need many servers. - With many servers, failures are common. - To tolerate failures, replicate. - Replication introduces possibility of inconsistency. - Implement consistency protocol. - Consistency protocol is slow; can't serve all our clients. - Add more servers... - Even rarer failures become common... - Increase replication factor... - Consistency is even more expensive to achieve... - Performance is still unacceptable... - Add even more servers... In other words:
-> There is a tradeoff between fault-tolerance, consistency, and performance. We will study several ways of navigating this tradeoff throughout the quarter. - Give up on fault tolerance: use sharding to get high performance, no replication means no consistency worries. - Give up on consistency: can use a cheaper consistency protocol to get better performance while retaining fault tolerance. - Give up on performance: use expensive consistency protocol and assume workload will be relatively low (often makes sense for "administrative" parts of a system). ### Challenges Why is this any different from "normal" programming? - All of us already know how to program. Isn't it just a matter of running a program on multiple machines? - Yes and no, but mostly no :) Remember: concurrency and partial failure. Concurrency is hard in any scenario. Partial failure is also hard. - machines crash - machines reboot (losing contents of memory) - disks fail (losing their contents) - networks packets are delayed and lost - network itself goes down in part or whole - machines misbehave (have bugs, get hacked) - send messages they aren't supposed to - networks misbehave (have bugs, get hacked) - corrupt packet contents in transit - inject packets that were never sent - people make mistakes - misconfigure machines - misconfigure network

- It's **super important** when designing a system to be really clear about what failures you are trying to handle, and what failures you are willing to ignore or treat as "impossible". - Both parts are important! What you leave out means a lot. - We will call this choice a "failure model" or "fault model". - Common failure models - Asynchronous unreliable message delivery with fail-stop nodes: - The network can arbitrarily delay, reorder, drop, and duplicate messages. - Nodes can crash, but if they do, they never come back. - This is the model for the labs - Asynchronous reliable in-order message delivery with fail-stop nodes: - The network can delay messages arbitrarily, but they are delivered in order and eventually exactly once. - Nodes same as above - This is the model for some important distributed algorithms we will study but not implement. - Can be implemented on top of the previous model with sequence numbers and retransmission. - "Byzantine" models refer to models that allow misbehavior - For example: - The network delivers a message to N2 as if it was sent by node N1, but N1 never sent that message. - The network corrupts a message. - A node does not follow the protocol (it got hacked, or cosmic ray, or something) - We will not study these models much, but fun fact: - Surprisingly, you *can* actually build systems that tolerate some number of Byzantine failures Want to be correct under *all* possible behaviors allowed by our fault model. - Distributed systems are usually very challenging to test because there are many combinations of failures to consider. - (This course does a particularly good job at testing!) ### The reading for today - Lots of good discussion, if a bit dated. - Check out our sweet animations - In general, readings for first 7 weeks supplement lectures and are good reference material. - Ok to read fast and come back to it later as needed. (Good skill.) #### The eight fallacies 1. The network is reliable. 1. Latency is zero. 1. Bandwidth is infinite. 1. The network is secure. 1. Topology doesn't change. 1. There is one administrator. 1. Transport cost is zero. 1. The network is homogeneous. ## Course Mechanics #### Intros - I'm James - We have TAs #### Teaching philosophy - To understand X, build an X - Systems <-> Theory ### Syllabus items #### Big picture course calendar: - distributed systems fundamentals: how to think about global state (3 weeks of lecture, labs 1 and 2, problem sets 1-4) - real-world distributed systems problems: replication, sharding, caching, and atomic transactions (4 weeks, labs 3 and 4, problem sets 5-8) - research papers in distributed systems: how people have built practical systems (3 weeks, reading 8 research papers and writing 8 blog posts) #### Work - Three kinds of assignments: labs, problem sets, and readings/blogs - Labs: - Core of the course - Extremely challenging -- alumni say one of the hardest things they did as an undergrad, but also very rewarding - Remember our teaching philosophy: to understand X, build an X - Problem sets: - One or two problems about every lecture, collected the following Friday - Recommend you do them soon after the lecture - Some problems will ask you things about your lab design. - Readings - No good textbook for this course. - For the first 7 weeks, we will assign classic papers and tutorials corresponding to most lectures. - Skim these before class, then come to class, then read more thoroughly as needed after class to do the problem sets and labs. - For the last 3 weeks, we will read "modern" research papers. - (Rare to get to do this in an undergrad class!) - Different flavor from the first 7 weeks. Much more practical. - Read these *deeply* before class and come prepared to discuss. - It's ok if you don't understand everything! - One of our goals is to teach you how to read a paper. - For each paper, you will write a post on the discussion board. - When the time comes, see the homepage for information on the required structure of each post. #### Collaboration rules - On your own: Lab 1, problem sets unless otherwise noted, blogs - With a partner: Lab 2-4, specially marked problem set questions - See course homepage for details. #### Grading - No exams or quizzes or anything like that. - The class is not curved. - You earn points by doing work. See the course webpage for exact point amounts. - Your grade out of 4.0 is computed by a formula from your total number of points. (Again, see webpage.) - Important consequence: All assignments are "optional" - Feel free to just not turn stuff in if you don't need the points for your target grade. - Late policy - Problem sets: (manually graded) 48 hours for free on every problem set, after that no credit. - Labs: (autograded) no penalty for late work, but other parts of the class depend on timely completion of labs. - Blogs: within 48 hours for half credit, after that no credit. #### Course organization - Lecture - Mostly (virtual) whiteboard - James will post notes - Section: mostly about the labs - Remote: - University seems sure we are going back in person next week - Do not come to class if you are sick, even if it is not covid. - If you get sick, let us know what you need. We can usually be flexible. - We should be prepared to have to return to remote temporarily at some point. #### Course resources - Zoom for all meetings until further notice (links on the webpage) - Gitlab for labs (turn in via git tag) - Ed for announcements, questions, and blog posts - Preferred mechanism for communication with staff - Discord for informal chitchat (not required, and you can be anonymous if you want) - Gradescope for delivering feedback ### More project info #### Overview - The course has a large quarter-long programming project - You will build a sharded, linearizable, fault-tolerant key-value store with dynamic load balancing and atomic multi-key transactions. - You are not expected to know what any of those words mean yet! You will by the end of the quarter. - Some definitions to get us started: - key-value store: persistent hash map (persistent meaning clients can store some data, go away and come back later and still find the data they stored earlier) - linearizable: from the client's perspective, the service "looks like" a single server - fault-tolerant: system continues to work despite some of its components failing - sharded: split key space across multiple nodes to increase performance - dynamic load balancing: keys can move between nodes, e.g. to spread work more evenly - atomic multi-key transactions: support linearizable operations over multiple keys, even if those keys are stored on different shards - It's still ok if this doesn't make much sense yet :) #### Project Mechanics - Lab 0: introduction to our framework and tools - Read through lab 0 and set up tools before section this week - Nothing to turn in - Lab 1: exactly once RPC, key-value store - Due Friday, January 14 - Completed individually - Lab 2-4: completed with a partner - Lab 2: primary/backup - Intro to replication for fault tolerance - Has some important limitations - Lab 3: Paxos - Realistic fault tolerance algorithm - Addresses many limitations of lab 2 - Lab 4: Sharding, Load balancing, multi-key cross-shard transactions #### Project Tools - Automated testing and grading - All tests and grading data is available to you while you complete the project - Two kinds of tests - "Run tests": specific scenarios constructed by staff to exercise important parts of your code - "Search tests": exhaustively search the space of executions using a model checker, makes sure your code works under all possible message delivery orderings and node failures - Visual debugger - Control and replay over message delivery and node failures - Implement in Java - Several restrictions to support model checking - Model checker needs to: - Detect all possible "next steps" from a given state - Select and execute each such step, in an order of its choosing - Collapse equivalent states for efficiency - Several consequences: - Our systems must be deterministic (given same inputs, perform exactly the same actions) - Must not "hide" any state from the model checker (static mutable fields, "real" I/O, random number generators, current date/time, ...) - A great example of engineering a system *to be tested* - In this class, we take this to the extreme: testing is the *primary* concern, because correctness is so challenging in distributed systems and is essential to understanding. We won't run our systems except in the test framework, so it becomes a kind of simulator. - In the real world, we would also need to take other concerns into account (performance on real networks and machines, etc.) - In fact, our test framework can be adopted to run on real networks (see Slabs paper) - In the real world, we recommend you don't place any *less* emphasis on correctness than we do in this class, but instead simply add further design constraints on performance, etc. - Systems engineered in this way tend to look quite different from the "naive" approach of "just write some code that calls the network API in various places".