Agenda: - Intro & administrivia - Intro to distributed - RPC - MapReduce & Lab 1 ---------- Administrivia - This is a class about distributed systems - I'm excited about this class - Distributed systems are some of the most powerful things we build in CS, but also one of the hardest -- lots of subtleties - Distributed systems are incredibly relevant today - web, Google, Facebook, Amazon, Netflix, anything mobile, ... This course: - key ideas, abstractions, and techniques for building distributed systems - assuming you've taken an undergrad OS or networking class or equivalent experience, e.g., should know about client-server apps, the TCP/IP stack, processes and threads, etc This course: - readings and discussions of research papers - no textbook - how to read a paper? - blog posts - a project based around building a scalable, consistent key-value store design Course staff: - instructor: Dan Ports, drkp@cs.washington.edu, CSE 468 office hours Mon 5pm or by appt - TA: Haichen Shen, haichen@cs.washington.edu - TA: Adriana Szekeres, aaasz@cs.washington.edu ---------- Distributed systems What is a distributed system? - multiple interconnected computers that cooperate to provide some service - examples? Our model of computation used to be a single computer - mainframe, PC, etc - now it's a datacenter with many thousands of servers - with clients on phones and browsers communicating over the wide area Why should we build (or care about) distributed systems? - Higher capacity - today's workloads don't ift on one machine - aggregate cpu cycles, memory, disks, network bandwidth - Geographic separation - Build reliable systems - even though unreliable components! What are the challenges in distributed system design? - system design: what goes on client/server, what are the right protocols? - reasoning about state in a distributed environment - locating data: what gets stored on which server - consistency: multiple copies of shared data - concurrency: multiple readers and writers to shared data - failures: - hardware failures - communication failures - can't tell the difference? - security (Byzantine faults) - performance - latency of coordinating between multiple machines - bandwidth as a scarce resource - testing - often too many possible cases (failures, etc) to test exhaustively We want more scalable, more reliable distributed systems. But it's easy to make a distributed system *less* scalable and *less* reliable than a centralized system. Major challenge keep the system doing useful work in the presence of partial failures A data center - Facebook, Pineville OR - 10x size of this building, $1B cost, 30MW power - 200K+ servers - 500K+ disk - 10K network switches - 300K+ network cables - What is the likelihood that all components are correctly functioning at any given time? Typical first year for a new cluster: [Jeff Dean, Google] ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~10000 hard drive failures Part of the system is *constantly* failed Lamport's definition, c. 1990: a distributed system is one where a computer you don't know about renders your own useless. And yet today: distributed systems work most of the time - wherever you are - whenever you want - even though parts of the system have failed - even though thousands of other people are using it Another challenge: managing state - concurrency and consistency - How do we keep all these states consistent? Thinking about distributed state involves a lot of subtleties Thought experiment 1 - suppose there's a group of people, two of whom have green dots on their foreheads - without using a mirror or directly asking, can anyone tell if they have a green dot themselves? - what if I say to everyone: someone has a green dot - everyone already knew this! - difference between individual knowledge and common knowledge - everyone knows that someone has a green dot, but doesn't know that everyone knows that everyone knows that, that everyone knows *that*, etc Thought experiment 2 - the two generals problem - two generals commanding two armies on two hills surrounding the enemy - both need to attack at the same time to succeed - in other words, we need common knowledge about the time of attack: each general knows that the other knows the time, and so on ad infinum - they can only coordinate using messengers that are lossy - what protocol could we use? - no protocol works! - if we had a protocol that could give us this common knowledge - the last message could have been dropped - so it obviously wasn't necessary, and we can remove it - but we could repeat this until we get rid of all the messages, and clearly that's not true Distributed systems are hard - we just saw that something pretty simple was impossible - we will see that other things you would want to do are impossible - consensus: get all nodes to always agree on a single value, e.g., the order of operations or the value of my bank account - build distributed systems that are always consistent and always available (the CAP theorem) - and yet we will do them anyway - many things are possible in practice if we make some assumptions - or failure cases can be made vanishingly rare Topics we will cover ---------- RPC How should we communicate between distributed systems? - could send messages manually - CS is about finding useful abstractions to simplify programming - this is one for network communication Common pattern: client/server communication - client sends request to server, server sends reply This paper is a bit hard to appreciate - 1984, Xerox PARC, Xerox Dorado, 3Mbit ethernet prototype, ... - What was a distributed system back then? - The idea seems obvious in retrospect Suppose I wanted to write a simple banking system... only on one host for now. float balance(int accountID) { return balance[accountID]; } void deposit(int accountID, float amount) { balance[accountID] += amount return OK; } client() { deposit(42, $50.00); print balance(42); } No problem calling deposit() and balance() -- these are just standard function calls Of course, we probably want our client to be able to run on a different machine, for all the usual reasons. So what would we have to do to call the balance/deposit functions on a remote machine. Define a protocol: request "balance" = 1 { arguments { int accountID (4 bytes) } response { float balance (8 bytes); } } request "deposit" = 2 { arguments { int accountID (4 bytes) float amount (8 bytes) } response { } } client() { s = socket(UDP) msg = {2, 42, 50.00} // marshalling send(s, server_address, msg) response = receive(s) check response == "OK" msg = {1, 42} send(s -> server_address, msg) response = receive(s) print "balance is" + response } server() { s = socket(UDP) bind s to port 1024 while (1) { msg, client_addr = receive(s) type = byte 0 of msg if (type == 1) { account = bytes 1-4 of msg // unmarshalling result = balance(account) send(s -> client_addr, result) } else if (type == 2) { account = bytes 1-4 of msg amount = bytes 5-12 of msg deposit(account, amount) send(s -> client_addr, "OK") } } Some details along the way: a socket is our portal for communicating with the outside world What is the server address? IP address + port, e.g. 128.208.5.65 : 1024 How do we know the IP address? Probably from DNS Why do we need the port? To identify a program on the particular machine; there might be more than one. How do we know the port number? Servers usually operate on standardized, well-known port numbers, e.g. 25 for email, 80 for HTTP OK, so we wrote a client/server application. But: - it was tedious and we are lazy - had to get the marshalling and unmarshalling exactly right on both ends; won't work if we're off by even one byte (and this is a simple example, and I waved my hands a bit; what about byte ordering, requests too large for 1 msg, etc) - this client looks nothing like the "client" we had before So you should be thinking: can we automate this process? Yes: if we wrote the protocol in a machine-readable way, we could generate a lot of this code. protocol ---RPC compiler---> client stubs, server stubs stubs are glue code that handle the communication details Client stub: deposit_stub(int account, float amount) { // marshall request type + arguments into buffer // send request to client // wait for reply // decode response // return result } To the client, looks like calling the deposit function we started with! Server stub: loop: wait for command decode and unpack request parameters call procedure build reply message containing results send reply pretty much exactly the code we just wrote for the server! To the server code (deposit, balance) looks just like the client is calling the procedures directly! Client code -> stub -> network <- stub <- server code The point of all this: transparency! - try to make remote procedure calls look just like local procedure calls - note that we could reuse the entire single-machine code we started with, and add the stubs to make it run on a distributed system. That seems great! Examples of RPC systems that do this sort of stub generation: - Sun rpcc (used in NFS paper) - XMLRPC/SOAP - Google Protocol Buffers - Java RMI How do the stubs find the server address? Could hardcode it -- inflexible Could ask a name service that maps service name to host, port Is RPC really transparent: can we really just treat remote procedure calls as ordinary procedure calls? Not quite - performance local call: maybe 10 cycles = ~3 ns RPC: 0.1-1 ms on a LAN => ~10K-100K slower in the wide area: can easily be millions of times slower - failures what happens if messages get dropped, the client or server crashes, etc? - also security, concurrent requests, etc. What kinds of failures are there? communication failures (messages may be delayed, variable round trip, never arrive) machines failures -- either client or server sometimes can't tell if it was a dropped request message or a dropped reply message or whether it was a communication failure or a machine failure or, if machine failure, whether it crashed before or after processing the request What semantics do we get with our RPC implementation above? Just hangs if there's a failure. Not good. Slightly better to timeout and tell the application we failed Also: might execute a request twice due to duplicate packet Alternative: at least once retry until we get a successful response Alternative: at most once give each request an ID and have the server keep track of whether it's been seen before dealing with server failures also a problem Which do you think is best? At-least-once versus at-most-once? What is right? depends where RPC is used. For example: file systems. Applications running on the file system don’t know whether files are local or remote. So no What then? NFS – at least once, and just deal with the occasional blurp Other network file systems: at most once, but inside the file system mask the problem from the user more sophisticated applications: need an application-level plan in both cases not clear at-once gives you a leg up => Handling machine failures makes RPC different than procedure calls ---------- MapReduce One of the first "big data" systems - hugely influential at Google, Hadoop, lots of intellectual children Motivation: huge data sets - from web crawls, logs, databases - too big to store and process on one machine! - petabyte scale. reading a TB from disk takes around an hour. so about a month just to read the data on one disk, and maybe multiple phases - what are the challenges? - which nodes should be involved? - what transport protocol should be used? - how do you manage threads / events / connections? - remote execution of your processing code? - fault tolerance and fault detection - Load balancing / partitioning of data - heterogeneity of nodes - skew in data - network topology and bisection bandwidth issues Would be really nice to solve this just once The map-reduce computation model input split ---map---> intermediate ----reduce---> output input & output are collections of key-value pairs. First question: can we express useful things in this model? Inverted index example - maps word -> list of documents containing it - input: [document name -> list of words] - output [word -> list of documents] - What's the intermediate key-value pair? [word -> document] - What is the map function? for docname, text in input: for word in text: emit(key=word, value=docname) // & maybe location, relevancy... - What does the reduce function get run on? All intermediate values with the same key - What is the reduce function? reduce(key, values): sort(values) emit(key, values) (not much of a reduce phase; the grouping by key did most of the work!) Word count example: - input: [document name -> text] - output [word -> count] - What's the intermediate key-value pair? [word -> document] - What is the map function? for docname, text in input: for word in text: emit(key=word, value=docname) // & maybe location, relevancy... - What does the reduce function get run on? All intermediate values with the same key - What is the reduce function? reduce(key, values): sort(values) emit(key, values) (not much of a reduce phase; the grouping by key did most of the work!) ** Is this a good model? *** Inverted indexes are useful. Are there other useful things we can do? counting words, finding all documents with a word, sorting, finding reverse web links - What can't we do? - Anything that changes data: map and reduce need to be non-mutating - online transaction processing (bank), document editing - Anything that worries about latency (vs batch processing) don't want to start a MR query on a user's Google search and in fact Google now uses incremental updates (Percolator) instead of MapReduce for generating the index - Anything that requires multiple passes over the same data 1. score web pages by the words they contain 2. score web pages by # of incoming links 3. combine the two scores 4. sort by combined score These restrictions simplify the problem! How does this get implemented? input splits -> map workers -> intermediate files -> reduce workers -> output files GFS underlaying all of this -- think of it as a single shared disk although really it's a replicated distributed file system we'll look at it later and a master floating around there How does word count get implemented? (see slides) Why might MR have good performance? Map and Reduce functions run in parallel on different workers Nx workers -> divide run-time by N But rarely quite that good: move map output to reduce workers stragglers read/write network file system Fault tolerance - What happens if a worker fails? Master detects it, reschedules to another worker - What happens if a worker is just slow Same thing. OK if it gets executed twice because it's immutable. - What happens if the master fails? Failure. User has to deal with it. Could probably replicate or checkpoint - What if an input causes failures? Throw it away after two tries crash. Is this OK? Apparently for web search. lab 1 has three parts: Part I: just Map() and Reduce() for word count Part II: we give you most of a distributed multi-server framework, you fill in the master code that hands out the work to a set of worker threads. Part III: make master cope with crashed workers by re-trying. - Discussion - How does the restricted model (simplifying assumptions) made by the authors help simplify the implementation? - map & reduce phases are easy to parallelize - not a shared address space; make programmers aware of the partitioning rather than hiding it - side-effect-free, no interactions between workers: don't have to worry about race conditions etc - side-effect-free: failures, stragglers, and re-executing is not a problem