Lecture 2: MapReduce 1. Workload question came up. Important for you to know this is among the most difficult material in CS -- the core algorithm we're going to have you implement is probably the most difficult to understand algorithm in widespread use. As with other courses, difference between users of distributed systems (easy) and how to build distributed systems (hard). I hope it will be 10-15% less than fall quarter 451, which was 10-15% less than last spring's 451. For the masters students, hard to tell -- easier since less coordination, harder because less of a safety net. Key to surviving: we're going to provide a lot of help. But you need to think and plan very carefully before writing code. And there are outs: skip an assignment if you have to. Even skip two -- it will have only a minimal effect on your grade. No exams; there are a few problem sets, but they will be short. 1. Common knowledge vs. knowledge OK, distributed systems is conceptually hard. Let me give you an example: group of people who can see each other. Two have green dots. If you are one, can you tell if you have a green dot? Without a mirror. Obviously not! Add knowledge everyone already has: someone has a green dot. Yet with that, one of the people with the green dot can now declare, yes. How? 2. Failures are endemic Initially, planes designed with one engine -- kind of made the whole endeavor risky! So: two engines. But ontime arrivals got much worse with two engines than with one. Why? Typical data center has several hundred thousand servers; millions of disks. tens of thousands of network switches. Many will be broken at any given point in time, or working badly. Need to build systems that don't just avoid doing the wrong thing under a failure, but continue to operate even while partially broken. FB DC in Oregon: $1B or so, size of football stadium. No AC, just evaporative eater cooling (bad because they are in a desert! but good because they are in a desert!) Inside, kind of like being in a wind tunnel. In winter, air circulates, no heating needed. In summer, exits. Draw a picture of the airflow; hot and cold aisles. 3. Labs focus: fault tolerance and consistency -- central to distrib sys lab 1: MapReduce labs 2/3/4: storage servers progressively more sophisticated (tolerate more kinds of faults) progressively harder too! patterened after real systems, e.g. NoSQL storage systems Lab 4 has core of a real-world design for 1000s of servers what you'll learn from the labs easy to listen to lecture / read paper and think you understand building forces you to really understand I hear and I forget, I see and I remember, I do and I understand (Confucius?) you'll have to do some design yourself we supply skeleton, requirements, and tests but we leave you substantial scope to solve problems your own way you'll get experience debugging distributed systems test cases simulate failure scenarios tricky to debug: concurrency and failures many client and servers operating in parallel test cases make servers fail at the "most" inopportune time [NOTE: not the same thing as an exhaustive test suite! What would that be? Mention Verdi -- ongoing project to prove correctness of core dist system software.] think first before starting to code! otherwise your solution will be a mess and/or, it will take you a lot of time code review learn from others judge other solutions we've tried to ensure that the hard problems have to do w/ distrib sys not e.g. fighting against language, libraries, &c thus Go (type-safe, garbage collected, slick RPC library) thus fairly simple services (mapreduce, key/value store) 3. Lab 1: MapReduce A programming model to allow most people using the DC to avoid thinking about failures and distribution. help you get up to speed on Go and distributed programming first exposure to some fault tolerance motivation for better fault tolerance in later labs motivating app for many papers popular distributed programming framework many descendants frameworks Computational model aimed at document processing map (k1,v1) → list(k2,v2) reduce (k2,list(v2)) → list(v2) split doc -> set of pairs run Map(k1, v1) on each element of each split -> set of pairs coalesce results from each split into a list for each key run Reduce(k2, list(v2)) -> list(v2) merge result user writes a map function and reduce function framework takes care of parallelism, distribution, and fault tolerance what computations are not targeted anything that updates a document Draw picture: input: partitioned into M (nearly equal, but not necessarily equal) chunks 1. Run mapper on each record in each chunk 2. Produces intermediate list of key/value pairs 3. Group by hash(key) (on each mapper) -> R chunks on each mapper 4. R reducers each read the files intended for it 5. output list of values Example: grep (look for which lines contain a pattern of text) master splits file into M almost equal chunks at line boundaries calls Map on each partition map phase for each partition, call map on each line of text search line for word output line number, line of text if word shows up, nil if not partition results among R reducers map writes each output record into a file, hashed on key reduce phase Reduce job collects 1/R output from each Map job all map jobs have completed! reduce(k1, v1) -> v2 identity function: v1 in, v1 out merges phase: master merges R outputs Performance number of jobs: M+R map jobs M, R > N servers how many intermediate files? Example: 200K M; 5K R; 2K N Questions: What exactly would happen if one block of one hard drive got erased during a map/reduce computation? What parts of the system would fix the error (if any), and what parts of the system would be oblivious (if any)? How do "stragglers" impact performance? how much speed up do we get on N machines? ideal: N bottlenecks stragglers network to collect a Reduce partition network to interact with FS disk I/O Why roughly fixed size for the initial size of split files? 64MB Fault tolerance model master is not fault tolerant assumption: this single machine won't fail during running a mapreduce app but many workers, so have to handle their failures assumption: workers are fail stop they fail and stop (e.g., don't send garbled weird packets after a failure) they may reboot What kinds of faults might we want to tolerate? network: lost packets duplicated packets temporary network failure server disconnected network partitioned server: server crash+restart (master versus worker?) server fails permanently (master versus worker?) all servers fail simultaneously -- power/earthquake bad case: crash mid-way through complex operation what happens fail in the middle of map or reduce? bugs -- but not in this course what happens when bug in map or reduce? same bug in Map over and over? management software kills app malice -- but not in this course Tools for dealing with faults? retry -- e.g. if pkt is lost, or server crash+restart packets (TCP) and MR jobs may execute MR job twice! replicate -- e.g. if one server or part of net has failed next labs replace -- for long-term health e.g., worker Retry jobs network falure: oops execute job twice ok for map reduce, because map/reduce produces same output map/reduce are "functional" or "deterministic" how about intermediate files? atomic rename worker failure: may have executed job or not so, we may execute job more than once! but ok for MapReduce as long Map and Reduce function are deterministic what would make Map or Reduce not deterministic? is executing a request twice in general ok? no. in fact, often not. unhappy customer if you execute one credit card transaction several times adding servers easy in MapReduce--just tell master hard in general server may have lost state (need to get new state) server may have rebooted quickly may need to recognize that to bring server up to date server may have a new role after reboot (e.g., not the primary) these harder issues you would have to deal with to make master fault tolerant topic of later labs lab 1 simplifications: no key in map assume global file system no partial failures (files either completely written or not created; if restart some failed operation, ok to write to the same filename) the lab 1 app (see main/wc.go) stubs for Map and Reduce you fill them out to implement word count (wc) how would you write grep? the lab 1 sequential implementation (see mapreduce/mapreduce.go) demo: run wc.go code walk through start with RunSingle() the lab 1 worker (see mapreduce/worker.go) the remote procedure calls (RPCs) arguments and replies (see mapreduce/common.go) Server side of RPC RPC handlers have a particular signature DoJob Shutdown RunWorker rpcs.Register: register named handlers -- so Call() can find them Listen: create socket on which to listen for RPC requests for distributed implementation, replace "unix" w. "tcp" replace "me" with a tuple name ServeConn: runs in a separate thread (why?) serve RPC concurrently a RPC may block Client side of RPC Register() call() (see common.go) make an RPC lab code dials for each request typical code uses a network connection for several requests but, real must be prepared to redial anyway a network connection failure, doesn't imply a server failure! we also do this to introduce failure scenarios easily intermittent network failures just loosing the reply, but not the request the lab 1 master (see mapreduce/master.go) You write it You will have to deal with distributing jobs You will have to deal with worker failures --- DeWitt A giant step backward in the programming paradigm for large-scale data intensive applications A sub-optimal implementation, in that it uses brute force instead of indexing Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago Missing most of the features that are routinely included in current DBMS Incompatible with all of the tools DBMS users have come to depend on"