Today: - review and discussion of MapReduce - graph processing: GraphLab - interactive queries: Spark Quick review of mapreduce (from lec 1) * The map-reduce computation model input split ---map---> intermediate ----reduce---> output input & output are collections of key-value pairs. * How does this get implemented? input splits -> map workers -> intermediate files -> reduce workers -> output files GFS underlaying all of this MapReduce opinion article - who are these people? - Stonebraker: built the ~first relational DBMS (Ingres) - Postgres, Mariposa, Aurora, C-Store, H-Store, ... - startups: Illustra, Streambase, Vertica, VoltDB, ... - latest Turing award - DeWitt: UW Madison / Microsoft - big name in parallel databases Discussion - is MapReduce a major step backwards? - are database people bitter jerks? :-) or are systems people ignorant of years of work? :-) Systems vs databases - operating systems and databases research are generally separate streams - distributed systems research *mostly* follows from the OS background - not always true, but MapReduce is an example - (my perspective: have worked on both) The database tradition - top-down design - define the right semantics first - relational model and abstract language (e.g., SQL) - concurrency properties (serializability) - then figure out how to implement them - usually in a general-purpose system - then figure out how to make them fast - provide general interfaces for users The OS tradition - bottom-up design - most important is engineering elegance - simple, narrow interfaces - clean, efficient implementations - performance and scalability are first-class concerns - figure out the semantics later - provide tools for programmers to build systems Where does MapReduce fit into this? Does it help explain the critique? Stonebraker and DeWitt's critiques - MR is a bad DB interface - no schema - imperative language vs declarative - poor implementation: no indexes, can't scale - not novel - missing DB features - bulk loading, indexing, transactions, constraints, etc. - incompatible with DBMS tools Is MapReduce even a database? - or is this an apples-to-oranges comparison? - are they solving hte right problem Lessons from MapReduce - specializing the system to focus on a particular type of processing is useful - map/reduce functional model supports writing parallel code (though so does a relational db!) - fault tolerance is easy: idempotent operations Non-lesson: map/reduce phases are not fundamental - easy to imagine that the only way to get this benefit is to have a system that follows the input -> map -> shuffle -> reduce -> output pattern - this makes it difficult to express some computations - but can build a more general data flow processing system Example that's difficult in MapReduce 1. score web pages by the words they contain 2. score web pages by # of incoming links 3. combine the two scores 4. sort by combined score - requires multiple MR runs, probably 1 per step - step 3 has two inputs -- can MR handle this? - will this be efficient? Dryad: MSR system that generalizes MapReduce - observation: MapReduce computation can be visualized as a DAG inputs, map workers, reduce workers, outputs - Dryad supports arbitrary programmer-specified DAGs - inputs, outputs are typed items - edges are channels containing sequence of typed items TCP connection, pipe, on-disk temp file - intermediate processing vertexes can have several inputs and outputs Very similar scheduling, fault tolerance story as mapreduce - vertices are stateless, deterministic computations - no cycles means that after failure, can just rerun a vertex. - if the vertex’s inputs are lost, then rerun upstream vertices transitively. How to program this abstraction? - don't want programmers to have to write graphs directly - DryadLINQ is an API that integrates with programming languages (e.g., C#) Example: word frequency count occurences of each word return top 3 (so requires final sort by frequency) public static IQueryable Histogram(input, k){ var words = input.SelectMany(x => x.Split(' ')); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x.Count); var top = ordered.Take(k); return top; } What does each statement do? input: "A line of words of wisdom" SelectMany: ["A", "line", "of¡, "words", "of", "wisdom"] GroupBy: [["A"], ["line"], ["of", "of"], ["words"], ["wisdom"]] Select: [ {"A", 1}, {"line", 1}, {"of", 2}, {"words", 1}, {"wisdom", 1}] OrderByDescending: [{"of", 2}, {"A", 1}, {"line", 1}, {"words", 1}, {"wisdom", 1}] Take(3): [{"of", 2}, {"A", 1}, {"line", 1}] How is this executed? (a dryad graph, see figure 7) original input was stored in four partitions computation proceeds in three stages stage 1: four machines, each reading one partition split into words hash(w) and send over network to one of GroupBys stage 2: each GroupBy works on a (new) partition by words e.g. a-g, h-m, n-t, s-z counts # of occurences of each word it's responsible for sorts by # occurences this is only a partial sort, to reduce cost of final sort stage 3: look at top few words from each partitioned computation pick top 3 Optimizations can be applied on DAG - without understanding program semantics! - e.g., remove redundancy, pipelining, do aggregation more eagerly ---------- GraphLab and machine learning Machine learning - ML and data mining are hugely popular areas now! - clustering, modeling, classification, prediction - need to run these algorithms on huge data sets - means we need to run them on distributed systems! Example: PageRank Still working out what the right frameworks are for this - message passing & threads (MPI/pthreads/etc) - leave all the hard distributed systems challenges to the programmer - serialization, load balancing, locking, deadlock, race conditions, fault tolerance, ... - but can be very efficient with a lot of work! - MapReduce - great for non-iterative workloads - fails when there are computational dependencies in the data - fails when there is an iterative structure - need to rerun mapreduce until it converges; programmer has to deal with this - implementation forces output to disk - Dryad - supports more workloads w/ dependencies but no iteration Why graphs - most ML/DM applications are amenable to graph structuring - ML/DM is often about dependencies between data - represent each data as a vertex - represent each dependency between two pieces of data as an edge Graph representation - graph = vertices + edges, each with data - graph structure is static, data is mutable - update function for a given vertex: f(v, S_v) -> (S_v, T) - S_v is the scope of vertex v: the data stored in v, and on all adjacent edges and vertices - can update any data in scope - then output a new list of vertexes that need to be rerun Synchrony - asynchronous computation - synchronous = all parameters are updated simultaneously, using parameter values from the previous time step - requires a barrier before starting next round - stragglers can limit performance - iterative MR works like this - asynchronous = continuously update parameters, always using most recent parameter values as input - adapts to differences in execution speed coming from heterogeneity in hardware, network, data - asynchronous supports dynamic computation - pagerank: some nodes converge quickly, others take a long time - can stop recomputing parameters that have already converged Correctness for graph processing - is asynchronous processing OK? - depends on the ML algorithm - sometimes we need to run each step exactly once w/ barrier - usually it's ok to compute asynchronously - sometimes it's even ok to run without locks at all! - serializability - same results as though we picked a sequential order of vertexes, and had them each run their update function sequentially 3 versions of graphlab - shared memory, multicore version (one machine) - Distributed GraphLab - PowerGraph (distributed, optimized for powerlaw graphs) Shared memory version: - maintains a queue of vertices to be updated, and iteratively and in parallel runs update functions on them - ensuring serializablity involves locking - basic idea: lock the entire scope of a vertex update function - weaker versions for optimizations - serializability ok if full consistency is used - or if edge consistency is used and update functions don't modify data in adjacent vertices - or if vertex consistency is used and update functions don't modify data in adjacent vertices or edges Making graphlab distributed - partition the workload across different machines - edge cutting: partition boundary is a set of edges => each vertex is on exactly one machine - ...except we need to maintain "ghost" vertices at the boundary that cache data on remote machiens - consistency problem: keep the ghost vertices up to date - partitioning control load balancing - want same number of vertices per partition (=> computation) - want same number of ghosts (=> network load for cache updates) Locking in graphlab - same general idea as in single-machine but now it's distributed! - enforcing consistency model requires acquiring set of locks - if need to grab a lock on an edge or vertex on the boundary, need to do it on both partitions involved - what about deadlock? - usual answer is to detect deadlocks and roll back transactions - GraphLab instead has a canonical ordering of lock acquisition Fault tolerance - MapReduce answer isn't good enough -- workers have state so we can't just reassign them - Take periodic, globally consistent snapshots - Chandy-Lamport snapshot algorithm! Power-law graphs - challenge: many graphs are not uniform - power-law: a few popular vertices with *many* edges, many unpopular vertices with a few edges - problem for graphlab: edge cuts are hugely imbalanced - some server is responsible for the popular high-degree node - so it has to run more computation - and that vertex will be cached everywhere, so lots of network bandwidth PowerGraph - subsequent version of GraphLab designed for power-law graphs - First innovation: partition by cutting vertices instead of edges - each edge is in one partition; vertices may be in multiple! - high-degree vertices are split over many partitions - Second innovation: parallelize update function - requires change to semantics of update function - each server computes its "local" change to a split vertex - e.g., PageRank contribution from other pages on that server - then accumulate and apply the partial updates - Third: better partitioning algorithms (won't discuss) ---------- Spark - framework for large-scale distributed computation - designed for high-performance interactive applications - relatively recent (2012) but used widely: IBM, Yahoo, Baidu, Groupon, etc... - 1000+ contributors Motivation - want a general framework for distributed computations - MapReduce isn't enough - too inflexible; can't handle iteration, etc - can't do interactive queries, only batch processing - argument: MapReduce can't handle complex interactive queries because the only way to share data across jobs is to store it in stable storage Spark challenge: sotre intermediate data in a way that's both fault-tolerant and efficient - want it to be in-memory because that's 10-100x faster than writing to disk - and enables data sharing between different computations - but data can be lost on failure! Resilient Distributed Datasets - an immutable collection of records, partitioned over nodes - only two ways to create a RDD - accessing some data set on stable storage - transformation of an existing RDD (map, join, etc) - creation is lazy; just specifies a plan for creating - actions, e.g., store result: cause RDD to be materialized PageRank example How is a RDD represented: - list of parent RDDs - function to compute on them - partitioning scheme - computation placement hint - list of partitions for this RDD Why do we need the list of parents? - if there's a failure, we can recompute the RDD from its parents Why do we need the partitioning information? - so a transformation that depends on multiple RDDs knows whether it needs to shuffle data (wide) or not (narrow) - narrow dependencies can be recomputed using just one partition (or a few) - wide partitions require a shuffle; require the entire parent RDD - for narrow dependencies, can co-locate parent and child on same machine -> no network traffic required Failure recovery summary - Spark only makes one in-memory copy of a newly computed RDD partition! (by default) - if it's lost, data is gone! - scheduler detects machine failure and schedules recomputation - will need to compute all partitions it depends on, until a partition is available - optional checkpointing - user can ask Spark scheduler to make some node persistent - expensive, but means that a failure won't have to recompute everything