Agenda:
 - Intro & administrivia
 - Intro to distributed
 - RPC
 - MapReduce & Lab 1


----------

Administrivia

- This is a class about distributed systems
- I'm excited about this class
  - Distributed systems are some of the most powerful things we build
    in CS, but also one of the hardest -- lots of subtleties
  - Distributed systems are incredibly relevant today
     - web, Google, Facebook, Amazon, Netflix, anything mobile, ...

This course:
 - key ideas, abstractions, and techniques for building distributed
 systems
 - assuming you've taken an undergrad OS or networking class
   or equivalent experience, e.g., should know about client-server
   apps, the TCP/IP stack, processes and threads, etc

This course:
 - readings and discussions of research papers
  - no textbook
  - how to read a paper?
  - blog posts
 - a project based around building a scalable, consistent key-value
   store design

Course staff:
 - instructor: Dan Ports, drkp@cs.washington.edu, CSE 468
   office hours Mon 5pm or by appt
 - TA: Haichen Shen, haichen@cs.washington.edu
 - TA: Adriana Szekeres, aaasz@cs.washington.edu

----------

Distributed systems

What is a distributed system?
 - multiple interconnected computers that cooperate to provide some
   service
 - examples?

Our model of computation used to be a single computer
 - mainframe, PC, etc
 - now it's a datacenter with many thousands of servers
 - with clients on phones and browsers communicating over the wide
 area

Why should we build (or care about) distributed systems?
 - Higher capacity
   - today's workloads don't ift on one machine
   - aggregate cpu cycles, memory, disks, network bandwidth
 - Geographic separation
 - Build reliable systems
  - even though unreliable components!

What are the challenges in distributed system design?
 - system design: what goes on client/server, what are the right
    protocols?
 - reasoning about state in a distributed environment
   - locating data: what gets stored on which server
   - consistency: multiple copies of shared data
   - concurrency: multiple readers and writers to shared data
 - failures:
  - hardware failures
  - communication failures
  - can't tell the difference?
 - security (Byzantine faults)
 - performance
   - latency of coordinating between multiple machines
   - bandwidth as a scarce resource
 - testing
   - often too many possible cases (failures, etc) to test exhaustively

We want more scalable, more reliable distributed systems. But it's
easy to make a distributed system *less* scalable and *less* reliable
than a centralized system.

Major challenge
  keep the system doing useful work in the presence of partial
  failures

A data center
 - Facebook, Pineville OR
 - 10x size of this building, $1B cost, 30MW power
 - 200K+ servers
 - 500K+ disk
 - 10K network switches
 - 300K+ network cables
 - What is the likelihood that all components are correctly
 functioning at any given time?

Typical first year for a new cluster: [Jeff Dean, Google]
 ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
 ~1 PDU failure
 (~500-1000 machines suddenly disappear, ~6 hours to come back)
 ~1 rack-move
  (plenty of warning, ~500-1000 machines powered down, ~6 hours)
 ~1 network rewiring
  (rolling ~5% of machines down over 2-day span)
 ~20  rack failures
  (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky
 (40-80 machines see 50% packetloss)
~8 network maintenances
 (4 might cause ~30-minute random connectivity losses)
~12 router reloads
 (takes out DNS and external vips for a couple minutes)
~3 router failures
 (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~10000 hard drive failures


Part of the system is *constantly* failed

Lamport's definition, c. 1990: a distributed system is one where a computer you
don't know about renders your own useless.

And yet today:
  distributed systems work most of the time
   - wherever you are
   - whenever you want
   - even though parts of the system have failed
   - even though thousands of other people are using it

Another challenge: managing state
 - concurrency and consistency
 - How do we keep all these states consistent?

Thinking about distributed state involves a lot of subtleties

Thought experiment 1
 - suppose there's a group of people, two of whom have green dots on
   their foreheads
 - without using a mirror or directly asking, can anyone tell if they
   have a green dot themselves?
 - what if I say to everyone: someone has a green dot
    - everyone already knew this!
 - difference between individual knowledge and common knowledge
   - everyone knows that someone has a green dot, but doesn't know
   that everyone knows that everyone knows that, that everyone knows
   *that*, etc

Thought experiment 2
 - the two generals problem
 - two generals commanding two armies on two hills surrounding the
 enemy
 - both need to attack at the same time to succeed
 - in other words, we need common knowledge about the time of attack:
  each general knows that the other knows the time, and so on ad infinum
 - they can only coordinate using messengers that are lossy
 - what protocol could we use?
 - no protocol works!
 - if we had a protocol that could give us this common knowledge
   - the last message could have been dropped
   - so it obviously wasn't necessary, and we can remove it
   - but we could repeat this until we get rid of all the messages,
     and clearly that's not true

Distributed systems are hard
 - we just saw that something pretty simple was impossible
 - we will see that other things you would want to do are impossible
   - consensus: get all nodes to always agree on a single value,
     e.g., the order of operations or the value of my bank account
   - build distributed systems that are always consistent and always
     available (the CAP theorem)
 - and yet we will do them anyway
   - many things are possible in practice if we make some assumptions
   - or failure cases can be made vanishingly rare

Topics we will cover

----------


RPC

How should we communicate between distributed systems?
 - could send messages manually
 - CS is about finding useful abstractions to simplify programming
 - this is one for network communication

Common pattern: client/server communication
 - client sends request to server, server sends reply

This paper is a bit hard to appreciate
 - 1984, Xerox PARC, Xerox Dorado, 3Mbit ethernet prototype, ...
 - What was a distributed system back then?
 - The idea seems obvious in retrospect


Suppose I wanted to write a simple banking system... only on one host
for now.

float balance(int accountID) {
   return balance[accountID];
}

void deposit(int accountID, float amount) {
   balance[accountID] += amount
   return OK;
}

client() {
   deposit(42, $50.00);
   print balance(42);
}

No problem calling deposit() and balance() -- these are just standard
function calls

Of course, we probably want our client to be able to run on a
different machine, for all the usual reasons.
So what would we have to do to call the balance/deposit functions on a
remote machine.

Define a protocol:    
   request "balance" = 1 {
      arguments { 
        int accountID (4 bytes)
      }
      response {
        float balance (8 bytes);
      }
   }
   request "deposit" = 2 {
      arguments { 
        int accountID (4 bytes)
        float amount (8 bytes)
      }
      response {
      }
   }

      
client() {
   s = socket(UDP)

   msg = {2, 42, 50.00}           // marshalling
   send(s, server_address, msg) 
   response = receive(s)
   check response == "OK"

   msg = {1, 42}
   send(s -> server_address, msg)
   response = receive(s)
   print "balance is" + response
}

server() {
   s = socket(UDP)
   bind s to port 1024
   while (1) {
      msg, client_addr = receive(s)
      type = byte 0 of msg
      if (type == 1) { 
          account = bytes 1-4 of msg    // unmarshalling
          result = balance(account)
          send(s -> client_addr, result)
      } else if (type == 2) {
          account = bytes 1-4 of msg
          amount = bytes 5-12 of msg
          deposit(account, amount)
          send(s -> client_addr, "OK")
      }
}

Some details along the way: a socket is our portal for communicating
with the outside world

What is the server address? IP address + port, e.g. 128.208.5.65 : 1024
  How do we know the IP address? Probably from DNS
  Why do we need the port? To identify a program on the particular
  machine; there might be more than one.
  How do we know the port number? Servers usually operate on
  standardized, well-known port numbers, e.g. 25 for email, 80 for HTTP


OK, so we wrote a client/server application.
But:
  - it was tedious and we are lazy
  - had to get the marshalling and unmarshalling exactly right on both
    ends; won't work if we're off by even one byte
    (and this is a simple example, and I waved my hands a bit; what
    about byte ordering, requests too large for 1 msg, etc)
  - this client looks nothing like the "client" we had before

So you should be thinking: can we automate this process?

Yes: if we wrote the protocol in a machine-readable way, we could
generate a lot of this code.

  protocol ---RPC compiler---> client stubs, server stubs
stubs are glue code that handle the communication details

Client stub:
  deposit_stub(int account, float amount) {
    // marshall request type + arguments into buffer
    // send request to client
    // wait for reply
    // decode response
    // return result
  }
  To the client, looks like calling the deposit function we started
  with!

Server stub:
  loop:
    wait for command
    decode and unpack request parameters
    call procedure
    build reply message containing results
    send reply
  pretty much exactly the code we just wrote for the server!

  To the server code (deposit, balance) looks just like the
  client is calling the procedures directly!

Client code -> stub -> network  <- stub <- server code

The point of all this: transparency!
 - try to make remote procedure calls look just like local procedure
   calls
 - note that we could reuse the entire single-machine code we started
   with, and add the stubs to make it run on a distributed system.
   That seems great!


Examples of RPC systems that do this sort of stub generation:
  - Sun rpcc (used in NFS paper)
  - XMLRPC/SOAP
  - Google Protocol Buffers
  - Java RMI

How do the stubs find the server address? 
   Could hardcode it -- inflexible
   Could ask a name service that maps service name to host, port
   

Is RPC really transparent: can we really just treat remote procedure
calls as ordinary procedure calls? Not quite

 - performance
   local call: maybe 10 cycles = ~3 ns
   RPC: 0.1-1 ms on a LAN => ~10K-100K slower
       in the wide area: can easily be millions of times slower

 - failures
   what happens if messages get dropped, the client or server crashes,
   etc?

 - also security, concurrent requests, etc.


What kinds of failures are there?

 communication failures (messages may be delayed, variable round trip, never arrive)

 machines failures -- either client or server
  
 sometimes can't tell if it was a dropped request message or a dropped
 reply message
   or whether it was a communication failure or a machine failure
   or, if machine failure, whether it crashed before or after
   processing the request

What semantics do we get with our RPC implementation above?
  Just hangs if there's a failure. Not good.
  Slightly better to timeout and tell the application we failed
  Also: might execute a request twice due to duplicate packet

Alternative: at least once
  retry until we get a successful response

Alternative: at most once
  give each request an ID and have the server keep track of whether
  it's been seen before
   dealing with server failures also a problem

Which do you think is best?

At-least-once versus at-most-once?
  
  What is right?
    depends where RPC is used.

For example: file systems.  Applications running on the file system don’t know whether files are local or remote.  So no

What then?  NFS – at least once, and just deal with the occasional blurp

Other network file systems: at most once, but inside the file system mask the problem from the user

more sophisticated applications: 
         need an application-level plan in both cases
	 not clear at-once gives you a leg up
  => Handling machine failures makes RPC different than procedure calls


----------

MapReduce

One of the first "big data" systems
 - hugely influential at Google, Hadoop, lots of intellectual children

Motivation: huge data sets
 - from web crawls, logs, databases
 - too big to store and process on one machine!
   - petabyte scale. reading a TB from disk takes around an hour.
     so about a month just to read the data on one disk, and maybe
     multiple phases
 - what are the challenges?

        - which nodes should be involved?
        - what transport protocol should be used?
        - how do you manage threads / events / connections?
        - remote execution of your processing code?
        - fault tolerance and fault detection
        - Load balancing / partitioning of data
        - heterogeneity of nodes
        - skew in data
        - network topology and bisection bandwidth issues

Would be really nice to solve this just once


The map-reduce computation model
   input split ---map---> intermediate ----reduce---> output

   input & output are collections of key-value pairs.

First question: can we express useful things in this model?

Inverted index example
 - maps word -> list of documents containing it
 - input: [document name -> list of words]
 - output [word -> list of documents]
 - What's the intermediate key-value pair?
   [word -> document]
 -  What is the map function?
   for docname, text in input:
      for word in text:
         emit(key=word, value=docname) // & maybe location, relevancy...
 - What does the reduce function get run on?
   All intermediate values with the same key
 - What is the reduce function?
  reduce(key, values):
       sort(values)
       emit(key, values)
    (not much of a reduce phase; the grouping by key did most of the
     work!)

Word count example:
 - input: [document name -> text]
 - output [word -> count]
 - What's the intermediate key-value pair?
   [word -> document]
 -  What is the map function?
   for docname, text in input:
      for word in text:
         emit(key=word, value=docname) // & maybe location, relevancy...
 - What does the reduce function get run on?
   All intermediate values with the same key
 - What is the reduce function?
  reduce(key, values):
       sort(values)
       emit(key, values)
    (not much of a reduce phase; the grouping by key did most of the
     work!)

** Is this a good model?
*** Inverted indexes are useful. Are there other useful things we can do?
counting words, finding all documents with a word, sorting, finding
reverse web links
- What can't we do?
   - Anything that changes data:  map and reduce need to be non-mutating
   - online transaction processing (bank), document editing
   - Anything that worries about latency (vs batch processing)
     don't want to start a MR query on a user's Google search
     and in fact Google now uses incremental updates (Percolator)
     instead of MapReduce for generating the index
- Anything that requires multiple passes over the same data
    1. score web pages by the words they contain
    2. score web pages by # of incoming links
    3. combine the two scores
    4. sort by combined score
These restrictions simplify the problem!


How does this get implemented?

input splits -> map workers -> intermediate files -> reduce workers ->
output files

GFS underlaying all of this -- think of it as a single shared disk
although really it's a replicated distributed file system
we'll look at it later

and a master floating around there


How does word count get implemented?
 (see slides)

Why might MR have good performance?
  Map and Reduce functions run in parallel on different workers
    Nx workers -> divide run-time by N
  But rarely quite that good:
    move map output to reduce workers
    stragglers
    read/write network file system
    
Fault tolerance
- What happens if a worker fails?
  Master detects it, reschedules to another worker
- What happens if a worker is just slow
  Same thing. OK if it gets executed twice because it's immutable.
- What happens if the master fails?
  Failure. User has to deal with it. Could probably replicate or checkpoint
- What if an input causes failures?
  Throw it away after two tries crash.
  Is this OK? Apparently for web search.


lab 1 has three parts:
  Part I: just Map() and Reduce() for word count
  Part II: we give you most of a distributed multi-server framework,
           you fill in the master code that hands out the work
           to a set of worker threads.
  Part III: make master cope with crashed workers by re-trying.


- Discussion
 - How does the restricted model (simplifying assumptions) made by the
    authors help simplify the implementation?
     - map & reduce phases are easy to parallelize
     - not a shared address space; make programmers aware of the
       partitioning rather than hiding it
     - side-effect-free, no interactions between workers:
       don't have to worry about race conditions etc
     - side-effect-free: failures, stragglers, and re-executing is not a
    problem