Why do we want to order events?


Suppose I wanted to build a distributed make server
 - central file server that holds source and object files
 - each file has a timestamp
 - clients specify the timestamps when they write a file
 - use timestamps to determine dependencies:
    rebuild object file O if it depends on some source file X that is
    newer
 - so we ought to be able to rebuild just what we need

What could go wrong?
 - client clocks aren't synchronized, so client could put in files
   with old timestamp; won't rebuild dependent files
 - we could use the server's clock... if there's only one server
 - how synchronized would we need to be? granularity = enough to
   synchronize 


Another example: posting to Facebook
 - remove boss as friend
 - post "My boss is the worst, I need a new job"
 - really don't want to get these in the wrong order!

Could we wind up doing these things in the wrong order?
 - The data is not stored in just one server
 - there are O(1M) servers
 - each request actually requires talking to O(100) servers!
 - there are lots of layers of caching: memcache in the data center,
   cross-datacenter replication, edge caches
 - how do we update all of these things consistently?
   - can we just use the wall clock? no, same reason

Is this really a problem?
 - How well are clocks synchronized in practice?
 - (How would we measure this?)
 - Amazon EC2 measurements
    range from ~50us within the same datacenter to 200+ms for
    wide-area links
 - and highly variable
 - Is this bad?
   You can do a RPC in ~3us (Linux) or even ~0.4 (Arrakis)
   so conceivably you could be off by 125 - 625,000 RPCs!
 - put another way, 200ms is a "human-scale" (noticable) number

Two approaches:
 - physical clocks
 - logical clocks

Physical clock synchronization

Straw-man approach:
 - pick a server as a master and have it broadcast the time
 - every client receives this message and sets its clock to the master's
 - how do we set the master's clock?
 - why doesn't this work?

Latency distribution
 - not uniform
 - why doesn't it go to zero? speed of light delays; transmission
   delays; routing delays
 - what causes the variance? queuing delays on the network, processing
   delays on the client/server
 - can it go arbitrarily high? packet loss; retries; routing loops

Can we build a better clock synchronization algorithm?

Interrogation-based approach
 - client sends RPC to server, server replies with time
 - client keeps track of send and receive time
 - assume symmetric latency
   - is this true? not really, but not much we can do about it
 - set clock to server + 1/2 latency

How precise can we get this?
 - depends on variance in latency
 - what does the accuracy depend on?

Doing better:
 - divide latency into minimum + variance 
 - average over multiple readings and/or multiple servers
 - try to manage the clock rate skew
 - use hardware support to track added delays (PTP)


Logical clocks (Lamport clocks)
 - alternative way to keep track of time
 - based on the idea that events that aren't causally related are
 concurrent
 - doesn't require any form of physical clock or synchronization


Definitions
 - process
   - abstractly, a sequence of events that are ordered
   - instructions in a thread of execution
   - operations being processed on a server (assuming there's no
   concurrency within the server!)
   - etc
 -  events: either local computation, or send/receive(message)
 - messages: any communication between two processes

Happens-before relationship:
 - within a thread, P1 before P2 means P1 -> P2
 - if a = send(M) and b = receive(M), a -> B
 - transitivity: a -> b & b -> c => a -> c (why?)

Examples
 - does p1 happen-before p3? yes, same thread
 - does p1 happen before r1? no
 - does p1 happen-before q2? yes, message
 - does p1 happen-before r4? yes, chain

What does all this mean?
 - a -> b means "b could have influenced a"
 - if a --/--> b then must b -> a? no!
   - what does this mean? these events are concurrent!
 - what does it mean for events to be concurrent?
   - if we had to put the events in a single global order, we could
     put either a or b first
   - it would not affect the system, i.e., nobody could tell the
   difference!

Abstract logical clock
 - if a -> b then C(a) < C(b) means:
 - if a, b on same process P_i, C_i(a) < C_i(b)
 - if a = send(m) on P_i and b = recv(m) on P_j, C_i(a) < C_j(b)
 - ...then C_i and C_j are comparable

How to maintain these timestamps?
 - note that there are multiple possible implementations. here's one:
 - each process i increments a counter C_i between any two events
 - each process i sends T_m (= C_i at time of send) with each message
 - on receiving message m, process j sets C_j to max(C_j+1, T_m+1)

What does this give us?
 - if a -> b then C(a) < C(b)
 - Is the converse true: if C(a) < C(b) then a -> b
   - no - p3 and q3 in the figure
   - they could also be concurrent
   - if we were to use the Lamport clock as a global order than we
     would be creating some unnecessary constraints

Could we build a logical clock where the converse *is* true?
 - yes.
 - note that this clock will still tell us some events are
   concurrent!
 - strawman: the clock is actually a dependency list, e.g. a list of
   all previous events
   whenever i receives a message from j, we add that i's dependency
   list to j's.
 - more efficient strawman: vector clocks (later this quarter!)

Suppose I only want to avoid causality anomalies.  
No message that depends on a past event is delivered before that event.
 - With Lamport clocks, need to know whether anyone will send you an earlier
 - message, before you can deliver a message.


----------

Distributed snapshots

Suppose we're running a large ML computation, e.g. PageRank
 - thousands of servers
 - each holds some subset of web pages
 - each page starts out with some reputation
 - each iteration: some of that page's reputation gets transferred to
   the pages it links to (which might be on different servers!)

What if a server crashes?
 - How did MapReduce deal with this? [reassign its jobs to someone
 else]
 - Why can't we do that here? [the servers have state]

If we wanted to take checkpoints, how often would we need to do that?
 - How would we synchronize them?
 - What would go wrong?
 - Could we just discard messages that we sent more than once?
 
----------

How do we build a system that transparently tolerates server failures?
 - Replication!
 - Use (at least) f+1 replicas to tolerate f failures
   [what kind of failures?]
 - Act just like a single copy to the clients
 - Challenge is coordinating operations so that they are applied with
   same results to all replicas even if messages drop and replicas
   fail

State machine replication
 - nobody ever realizes this, but the earliest mention is in Lamport's
   clocks paper!
 - Model the system as a state machine
  - maintains some amount of state
  - transition function (input, state) -> new state
  - output function (input, state) -> output
 - In other words, system state and output is entirely determined by
   inputs

Can everything be modeled as a state machine?
 - needs to be deterministic
   what about clocks, randomness, etc?
   parallelism on multicores?
 - state needs to include everything

Goal: achieve a consistent order of operations
 - what does consistent mean here?
 - is causal enough?
 - serializable: there is a single global order of operations
   consistent with what each client sees
 - actually strict serializable / linearizable: notion of time
   included too; if a finishes before b starts, then a comes before b
   in the serial order

There are many ways to achieve this goal
 - today: chain replication
 - Lab 2: simpler primary backup replication (simplified chain rep)
 - two weeks: Paxos / Viewstamped Replication

Why are we doing chain replication:
 - used occasionally, e.g., HyperDex
 - an instantiation of primary-backup, which is used widely
   (database replication)
 - fairly straightforward
 - similar to Lab 2
 - will compare to other approaches (Paxos) later
 
Primary copy approach
 - some form of this is very common in replication
 - key idea: have a primary sequence all requests
  - when a primary is active, all other replicas execute requests in
    the primary's assigned order
  - client sees results consistent with that order
  - client doesn't get results back / declare success until request
    processed by enough replicas [here, f+1/f+1; elsewhere f+1/2f+1]
  - when the primary fails, we replace it and make sure that
    the new primary knows the order of all successful operations
    *this is the really hard part!*

What assumptions does CR make about the environment?
 - operations are read/write
 - f+1 nodes to tolerate f failures
 - nodes fail only by crashing, and crashes are detected
 - fault-tolerant master service keeps track of the system membership
 How realistic are these assumption?

Chain replication approach
 - organize nodes into a chain
 - updates sent to head, propagated down chain, tail responds
 - key invariant: each node has a superset of the state all subsequent
   nodes in the chain
 - what is the commit point?
 - can do the same with reads, or send them directly to the tail


Failures
 - what happens if the tail fails?
    - remove it and make the predecessor the new tail
    - is this ok? tail didn't have any operations anyone else didn't
      have too
    - new tail might have processed some operations the old tail
      hadn't, but that's strictly better
 - what happens if the head fails?
   - just remove it and tell everyone that the next node is the head
   - can we lose operations? yes, if the head was the only one to
   process them
   - is this ok? yes - either some replica did, in which case all
   others will, or it's lost entirely, which is ok too
 - what happens if a node in the middle failed?
    - cut it out of the chain
    - but need to ensure that we didn't drop any requests on the floor
    - use acknowledgments so that everyone knows how far the tail has gotten
    - have the predecessor resend any messages that the tail hasn't
    processed yet
 -  what happens when we add a node?
   - have the tail send it all operations
   - once it has the state, it is ready to start serving as the tail
   - have the old tail forward all new operations to the new
   - notify master and clients about the new tail
   - important to get this order right!
 - what if the master fails? game over!

lab 2 overview:
  agreement:
    "view server" decides who p and b are
    clients and servers ask view server
    they don't make independent decisions
    only one vs, avoids multiple machines independently deciding who is p
  repair:
    view server can co-opt "idle" server as b after old b becomes p
    primary initializes new backup's state
  the tricky part:
    1. only one primary!
    2. primary must have state!
  we will work out some rules to ensure these

view server
  maintains a sequence of "views"
    view #, primary, backup
    1: S1 S2
    2: S2 --
    3: S2 S3
  monitors server liveness
    each server periodically sends a Ping RPC
    "dead" if missed N Pings in a row
    "live" after single Ping
  can be more than two servers Pinging view server
    if more than two, "idle" servers
  if primary is dead
    new view with previous backup as primary
  if backup is dead, or no backup
    new view with previously idle server as backup
  OK to have a view with just a primary, and no backup
    but -- if an idle server is available, make it the backup

how to ensure new primary has up-to-date replica of state?
  only promote previous backup
  i.e. don't make an idle server the primary
  but what if the backup hasn't had time to acquire the state?

how to avoid promoting a state-less backup?
  example:
    1: S1 S2
             S1 stops pinging viewserver
    2: S2 S3
             S2 *immediately* stops pinging
    3: S3 --
    potential mistake: maybe S3 never got state from S2
    better to stay with 2/S2/S3, maybe S2 will revive
    how can viewserver know it's OK to change views?
  lab 2 answer:
    primary in each view must acknowledge that view to viewserver
    viewserver must stay with current view until acknowledged
    even if the primary seems to have failed
    no point in proceeding since not acked == backup may not be initialized

Q: can more than one server think it is primary?
   1: S1, S2
      net broken, so viewserver thinks S1 dead but it's alive
   2: S2, --
   now S1 alive and not aware of view #2, so S1 still thinks it is primary
   AND S2 alive and thinks it is primary
   => split brain, no good
[In particular, a client C1 that didn't see the new view could contact S1 while
another client C2 could contact S2; if they updated the same key, that would result in an inconsistency.] 

how to ensure only one server acts as primary?
  even though more than one may *think* it is primary
  "acts as" == executes and responds to client requests
  the basic idea:
    1: S1 S2
    2: S2 --
    S1 still thinks it is primary
    S1 must forward ops to S2
    S2 thinks S2 is primary
    so S2 can reject S1's forwarded ops

the rules:
  1. primary in view i must have been primary or backup in view i-1 
  2. primary must wait for backup to accept each request   (incl. get)
  3. primary must wait responding to client until backup acked it execute request
  4. non-backup must reject forwarded requests
  5. non-primary must reject direct client requests
  6. every operation must be before or after state transfer (see below)

example:
  1: S1, S2
     viewserver stops hearing Pings from S1
  2: S2, --
  it may be a while before S2 hears about view #2
  before S2 hears about view #2
    S1 can process ops from clients, S2 will accept forwarded requests
    S2 will reject ops from clients who have heard about view #2
[and trigger a thread to fetch new view]
  after S2 hears
    if S1 receives client request, it will forward, S2 will reject
      so S1 can no longer act as primary
    S1 will send error to client, client will ask vs for new
       view, client will re-send to S2
  the true moment of switch-over occurs when S2 hears about view #2

how can new backup get state?
  e.g. the key/value state
  if S2 is backup in view i, but was not in view i-1,
    S2 should ask primary to transfer the complete state

rule for state transfer:
  every operation (Put/Get) must be either before or after state xfer
  if before, xferred state must reflect operation
  if after, primary must forward operation after xfer finishes

Q: does primary need to forward Get()s to backup?
   after all, Get() doesn't change anything, so why does backup need to know?
   and the extra RPC costs time

A: Remember Get/Put/PutHash must provide exactly once semantics.
Assume the primary doesn't forward a Get.

In the example above, with C1 and C2 in contact with old P and old B/new P,
where old P and C1 hasn't seen the new view, then reads would be returned 
from the old P -- nothing in a read would force the old P to discover the
view change.

Suppose concurrently, C2 updates the key at new P. C1 will read the key  
but not see the new result.

Q. Is it ever ok to read only at the primary?

YES! If the view has no backup, then there can be no new P without
notifying the old primary first, so there's no problem.

Also Puts can be done at the primary alone, if there is no backup.

They must, if the system is to be available, you need to be able to
do operations with only a primary and no backup.

Q. At MIT, the example they used to show that you can't read only at 
the primary was this one.  But it is actually not a problem! Why?
    C1 sends Get to P, but not to B
    C1 doesn't receive response
    C2 sends Put, which is processed P and B
    P fails
    B takes over
    C1 resends Get, and observes Put, even though it shoudn't

Q: how could we make primary-only Get()s work?

Q: are there cases when the lab 2 protocol cannot make forward progress?
   Yes.  Many.
   View service fails
   Primary fails before it acknowledges view
   Backup fails before it gets new state
   ....
   We will start fixing those in lab 3