Today:
 - review and discussion of MapReduce
 - graph processing: GraphLab
 - interactive queries: Spark

Quick review of mapreduce (from lec 1)

* The map-reduce computation model
   input split ---map---> intermediate ----reduce---> output

   input & output are collections of key-value pairs.

* How does this get implemented?

  input splits -> map workers -> intermediate files -> reduce workers ->
  output files

  GFS underlaying all of this


MapReduce opinion article
 - who are these people?
  - Stonebraker: built the ~first relational DBMS (Ingres)
     - Postgres, Mariposa, Aurora, C-Store, H-Store, ...
     - startups: Illustra, Streambase, Vertica, VoltDB, ...
     - latest Turing award
  - DeWitt: UW Madison / Microsoft
     - big name in parallel databases

Discussion
  - is MapReduce a major step backwards?
  - are database people bitter jerks? :-)
    or are systems people ignorant of years of work? :-)

Systems vs databases
 - operating systems and databases research are generally separate
   streams
 - distributed systems research *mostly* follows from the OS
   background
     - not always true, but MapReduce is an example
 - (my perspective: have worked on both)

The database tradition
 - top-down design
 - define the right semantics first
   - relational model and abstract language (e.g., SQL)
   - concurrency properties (serializability)
 - then figure out how to implement them
   - usually in a general-purpose system
   - then figure out how to make them fast
 - provide general interfaces for users

The OS tradition
 - bottom-up design
 - most important is engineering elegance
   - simple, narrow interfaces
   - clean, efficient implementations
   - performance and scalability are first-class concerns
 - figure out the semantics later
 - provide tools for programmers to build systems

Where does MapReduce fit into this? Does it help explain the critique?

Stonebraker and DeWitt's critiques
 - MR is a bad DB interface
  - no schema
  - imperative language vs declarative
 - poor implementation: no indexes, can't scale
 - not novel
 - missing DB features
   - bulk loading, indexing, transactions, constraints, etc.
 - incompatible with DBMS tools

Is MapReduce even a database?
  - or is this an apples-to-oranges comparison?
  - are they solving hte right problem

Lessons from MapReduce
  - specializing the system to focus on a particular type of
    processing is useful
  - map/reduce functional model supports writing parallel code
    (though so does a relational db!)
  - fault tolerance is easy: idempotent operations

Non-lesson: map/reduce phases are not fundamental
 - easy to imagine that the only way to get this benefit is to have a
   system that follows the
    input -> map -> shuffle -> reduce -> output
   pattern
 - this makes it difficult to express some computations
 - but can build a more general data flow processing system

Example that's difficult in MapReduce
    1. score web pages by the words they contain
    2. score web pages by # of incoming links
    3. combine the two scores
    4. sort by combined score
  - requires multiple MR runs, probably 1 per step
  - step 3 has two inputs -- can MR handle this?
  - will this be efficient?

Dryad: MSR system that generalizes MapReduce
 - observation: MapReduce computation can be visualized as a DAG
    inputs, map workers, reduce workers, outputs
 - Dryad supports arbitrary programmer-specified DAGs
   - inputs, outputs are typed items
   - edges are channels containing sequence of typed items
       TCP connection, pipe, on-disk temp file
   - intermediate processing vertexes can have several inputs and
     outputs
Very similar scheduling, fault tolerance story as mapreduce
  - vertices are stateless, deterministic computations
  - no cycles means that after failure, can just rerun a vertex.
    - if the vertex’s inputs are lost, then rerun upstream vertices
      transitively.

How to program this abstraction?
  - don't want programmers to have to write graphs directly
  - DryadLINQ is an API that integrates with programming languages
    (e.g., C#)


Example: word frequency 
    count occurences of each word
    return top 3 (so requires final sort by frequency)

public static IQueryable<Pair> Histogram(input, k){
  var words = input.SelectMany(x => x.Split(' ')); 
  var groups = words.GroupBy(x => x); 
  var counts = groups.Select(x => new Pair(x.Key, x.Count())); 
  var ordered = counts.OrderByDescending(x => x.Count); 
  var top = ordered.Take(k); 
  return top;
}

  What does each statement do?
    input: "A line of words of wisdom"
    SelectMany: ["A", "line", "of¡, "words", "of", "wisdom"]
    GroupBy: [["A"], ["line"], ["of", "of"], ["words"], ["wisdom"]]
    Select: [ {"A", 1}, {"line", 1}, {"of", 2}, {"words", 1}, {"wisdom", 1}]
    OrderByDescending:
      [{"of", 2}, {"A", 1}, {"line", 1}, {"words", 1}, {"wisdom", 1}]
    Take(3): [{"of", 2}, {"A", 1}, {"line", 1}]

   How is this executed?  (a dryad graph, see figure 7)
     original input was stored in four partitions
     computation proceeds in three stages
     stage 1: four machines, each reading one partition
       split into words
       hash(w) and send over network to one of GroupBys
     stage 2: each GroupBy works on a (new) partition by words
       e.g. a-g, h-m, n-t, s-z
       counts # of occurences of each word it's responsible for
       sorts by # occurences
         this is only a partial sort, to reduce cost of final sort
     stage 3:
       look at top few words from each partitioned computation
       pick top 3

Optimizations can be applied on DAG
  - without understanding program semantics!
  - e.g., remove redundancy, pipelining, do aggregation more eagerly


----------

GraphLab and machine learning

Machine learning
 - ML and data mining are hugely popular areas now!
 - clustering, modeling, classification, prediction
 - need to run these algorithms on huge data sets
 - means we need to run them on distributed systems!

Example: PageRank

Still working out what the right frameworks are for this
  - message passing & threads (MPI/pthreads/etc)
    - leave all the hard distributed systems challenges to the
      programmer
       - serialization, load balancing, locking, deadlock, race
         conditions, fault tolerance, ...
       - but can be very efficient with a lot of work!
  - MapReduce
    - great for non-iterative workloads
    - fails when there are computational dependencies in the data
    - fails when there is an iterative structure
      - need to rerun mapreduce until it converges; programmer has to
        deal with this
    - implementation forces output to disk
  - Dryad
    - supports more workloads w/ dependencies but no iteration

Why graphs
  - most ML/DM applications are amenable to graph structuring
  - ML/DM is often about dependencies between data
    - represent each data as a vertex
    - represent each dependency between two pieces of data as an edge

Graph representation
 - graph = vertices + edges, each with data
 - graph structure is static, data is mutable
 - update function for a given vertex:
    f(v, S_v) -> (S_v, T)
    - S_v is the scope of vertex v:
       the data stored in v, and on all adjacent edges and vertices
    - can update any data in scope
    - then output a new list of vertexes that need to be rerun


Synchrony
 - asynchronous computation
   - synchronous = all parameters are updated simultaneously, using
     parameter values from the previous time step
      - requires a barrier before starting next round
      - stragglers can limit performance
      - iterative MR works like this
   - asynchronous = continuously update parameters, always using most
     recent parameter values as input
      - adapts to differences in execution speed coming from
        heterogeneity in hardware, network, data
   - asynchronous supports dynamic computation
     - pagerank: some nodes converge quickly, others take a long time
     - can stop recomputing parameters that have already converged

Correctness for graph processing
  - is asynchronous processing OK?
  - depends on the ML algorithm
    - sometimes we need to run each step exactly once w/ barrier
    - usually it's ok to compute asynchronously
    - sometimes it's even ok to run without locks at all!
  - serializability
    - same results as though we picked a sequential order of vertexes,
      and had them each run their update function sequentially

3 versions of graphlab
  - shared memory, multicore version (one machine)
  - Distributed GraphLab
  - PowerGraph (distributed, optimized for powerlaw graphs)

Shared memory version:
  - maintains a queue of vertices to be updated, and iteratively and
    in parallel runs update functions on them
  - ensuring serializablity involves locking
  - basic idea: lock the entire scope of a vertex update function
  - weaker versions for optimizations
    - serializability ok if full consistency is used
    - or if edge consistency is used and update
      functions don't modify data in adjacent vertices
    - or if vertex consistency is used and update
      functions don't modify data in adjacent vertices or edges

Making graphlab distributed
  - partition the workload across different machines
    - edge cutting: partition boundary is a set of edges =>
      each vertex is on exactly one machine
    - ...except we need to maintain "ghost" vertices at the boundary
      that cache data on remote machiens
    - consistency problem: keep the ghost vertices up to date
    - partitioning control load balancing
       - want same number of vertices per partition (=> computation)
       - want same number of ghosts (=> network load for cache
         updates)

Locking in graphlab
  - same general idea as in single-machine but now it's distributed!
  - enforcing consistency model requires acquiring set of locks
  - if need to grab a lock on an edge or vertex on the boundary, need
    to do it on both partitions involved
  - what about deadlock?
    - usual answer is to detect deadlocks and roll back transactions
    - GraphLab instead has a canonical ordering of lock acquisition

Fault tolerance
  - MapReduce answer isn't good enough -- workers have state so we
    can't just reassign them
  - Take periodic, globally consistent snapshots
    - Chandy-Lamport snapshot algorithm!

Power-law graphs
  - challenge: many graphs are not uniform
    - power-law: a few popular vertices with *many* edges, many
      unpopular vertices with a few edges
  - problem for graphlab: edge cuts are hugely imbalanced
    - some server is responsible for the popular high-degree node
      - so it has to run more computation
      - and that vertex will be cached everywhere, so lots of network
        bandwidth

PowerGraph
  - subsequent version of GraphLab designed for power-law graphs
  - First innovation: partition by cutting vertices instead of edges
     - each edge is in one partition; vertices may be in multiple!
     - high-degree vertices are split over many partitions
  - Second innovation: parallelize update function
     - requires change to semantics of update function
     - each server computes its "local" change to a split vertex
       - e.g., PageRank contribution from other pages on that server
     - then accumulate and apply the partial updates
  - Third: better partitioning algorithms (won't discuss)

----------

Spark
 - framework for large-scale distributed computation
 - designed for high-performance interactive applications
 - relatively recent (2012) but used widely: IBM, Yahoo, Baidu,
   Groupon, etc...
   - 1000+ contributors

Motivation
  - want a general framework for distributed computations
  - MapReduce isn't enough
    - too inflexible; can't handle iteration, etc
    - can't do interactive queries, only batch processing
    - argument: MapReduce can't handle complex interactive queries
      because the only way to share data across jobs is to store it in
      stable storage

Spark challenge: sotre intermediate data in a way that's both
fault-tolerant and efficient
  - want it to be in-memory because that's 10-100x faster than writing
    to disk
  - and enables data sharing between different computations
  - but data can be lost on failure!

Resilient Distributed Datasets
  - an immutable collection of records, partitioned over nodes
  - only two ways to create a RDD
    - accessing some data set on stable storage
    - transformation of an existing RDD (map, join, etc)
  - creation is lazy; just specifies a plan for creating
  - actions, e.g., store result: cause RDD to be materialized

PageRank example

How is a RDD represented:
  - list of parent RDDs
  - function to compute on them
  - partitioning scheme
  - computation placement hint
  - list of partitions for this RDD

Why do we need the list of parents?
  - if there's a failure, we can recompute the RDD from its parents

Why do we need the partitioning information?
  - so a transformation that depends on multiple RDDs knows whether it
    needs to shuffle data (wide) or not (narrow)
  - narrow dependencies can be recomputed using just one partition (or
    a few)
  - wide partitions require a shuffle; require the entire parent RDD
  - for narrow dependencies, can co-locate parent and child on same
    machine -> no network traffic required 
    
Failure recovery summary
  - Spark only makes one in-memory copy of a newly computed RDD
    partition! (by default)
  - if it's lost, data is gone!
  - scheduler detects machine failure and schedules recomputation
    - will need to compute all partitions it depends on, until a
      partition is available
  - optional checkpointing
     - user can ask Spark scheduler to  make some node persistent
     - expensive, but means that a failure won't have to recompute
       everything