Today: two primary-back up systems
    wrap up Lab 2 discussion
        high availability
        deals with network partition
    Hypervisor-based fault tolerance
	deals with non-determinism 
        but doesn't tolerate network partitions

lab 2 goals:
  availability and correctness despite server and network failures

  tolerate network problems, including partition
    either keep going, correctly
    or suspend operations until network is repaired
  replacement of failed servers

lab 2 overview:
  agreement:
    "view server" decides who p and b are
    clients and servers ask view server
    they don't make independent decisions
    only one vs, avoids multiple machines independently deciding who is p
  repair:
    view server can co-opt "idle" server as b after old b becomes p
    primary initializes new backup's state

view server
  maintains a sequence of "views"
  monitors server liveness
    each server periodically sends a Ping RPC
    "dead" if missed N Pings in a row
    "live" after single Ping
  if primary is dead
    new view with previous backup as primary
  if backup is dead, or no backup
    new view with previously idle server as backup
  OK to have a view with just a primary, and no backup
 
VS ignores server failures if the backup is not up to date
  VS does not change views unless primary has ack'ed 
  primary only acks if backup is uptodate

Q: what if primary fails and there is no backup?
  wait for it to recover

Q: under what circumstances is the system unavailable?

  if viewserver fails
  if network fails
  if network partitions so that client can't speak to primary selected by vs
  if only one server (primary), and the server fails
  if two servers, but the primary fails before ack'ing the view change
    [backup could in fact be up to date, but vs not told yet]

Q: can more than one server think it is primary?
   1: S1, S2
      net broken, so viewserver thinks S1 dead but it's alive
   2: S2, --
   now S1 alive and not aware of view #2, so S1 still thinks it is primary
   AND S2 alive and thinks it is primary
   => split brain, no good
[In particular, a client C1 that didn't see the new view could contact S1 while
another client C2 could contact S2; if they updated the same key, that would result in an inconsistency.] 

how to ensure only one server acts as primary?
  even though more than one may *think* it is primary
  "acts as" == executes and responds to client requests
  the basic idea:
    1: S1 S2
    2: S2 --
    S1 still thinks it is primary
    S1 must forward ops to S2
    S2 thinks S2 is primary
    so S2 can reject S1's forwarded ops

the rules:
  1. primary in view i must have been primary or backup in view i-1 
  2. primary must wait for backup to accept each request   (incl. get)
  3. primary must wait responding to client until backup acked it execute request
  4. non-backup must reject forwarded requests
  5. non-primary must reject direct client requests
  6. every operation must be before or after state transfer (see below)

example:
  1: S1, S2
     viewserver stops hearing Pings from S1
  2: S2, --
  it may be a while before S2 hears about view #2
  before S2 hears about view #2
    S1 can process ops from clients, S2 will accept forwarded requests
    S2 will reject ops from clients who have heard about view #2
[and trigger an RPC to fetch new view]
  after S2 hears
    if S1 receives client request, it will forward, S2 will reject
      so S1 can no longer act as primary
    S1 will send error to client, client will ask vs for new
       view, client will re-send to S2
  the true moment of switch-over occurs when S2 hears about view #2

how can new backup get state?
  e.g. the key/value state
  if S2 is backup in view i, but was not in view i-1,
    S2 should ask primary to transfer the complete state

rule for state transfer:
  every operation (Put/Get) must be either before or after state xfer
  if before, xferred state must reflect operation
  if after, primary must forward operation after xfer finishes

Q: does primary need to forward Get()s to backup?
   after all, Get() doesn't change anything, so why does backup need to know?
   and the extra RPC costs time

A: Remember Get/Put/PutHash must behave as if they occur in some sequential 
order on a server that never fails -- that is, exactly once semantics.

Assume the primary doesn't forward a Get.

In the example above, with C1 and C2 in contact with old P and old B/new P,
where old P and C1 hasn't seen the new view, then reads would be returned 
from the old P -- nothing in a read would force the old P to discover the
view change.

Suppose concurrently, C2 updates the key at new P. C1 will read the key  
but not see the new result.

[For those who took 444, this means the system is not linearizable -- 
a GET that starts after the PUT returns, should always return the 
result of the PUT.  A GET that starts before the PUT returns, is
allowed to return either the data before or after the PUT.]

Q. Is it ever ok to read only at the primary?

YES! If the view has no backup, then there can be no new P without
notifying the old primary first, so there's no problem.

Also Puts can be done at the primary alone, if there is no backup.

They must, if the system is to be available, you need to be able to
do operations with only a primary and no backup.

Q. At MIT, the example they used to show that you can't read only at 
the primary was this one.  But it is actually not a problem! Why?
    C1 sends Get to P, but not to B
    C1 doesn't receive response
    C2 sends Put, which is processed P and B
    P fails
    B takes over
    C1 resends Get, and observes Put, even though it shoudn't

Q: how could we make primary-only Get()s work?

Q: are there cases when the lab 2 protocol cannot make forward progress?
   Yes.  Many.
   View service fails
   Primary fails before it acknowledges view
   Backup fails before it gets new state
   ....
   We will start fixing those in lab 3

An implementation hint: 
  Two ways you can think of implementing the state transfer for Lab 2b.

  1. Maintain the current map; transfer the current map.
  Between operations! All RPCs appear to occur either before or after the transfer.
  But recall that "at most once" means you need to save the response
that you previously gave to an RPC.  If backup takes over, needs to
act as if it was the primary.

Only one RPC outstanding at a time, so means state = map + result of last RPC from each client.

 2. All that state can be derived from the operation log: the sequence of RPC's processed at the primary.  So a potentially simpler, less efficient solution is just to record the operational log, and transfer that.

The current value of a hash key is just the last value written in the log for that key.

3. In practice, most systems would do a hybrid: take a checkpoint of the operational log so that it doesn't grow without bound, and then transfer the checkpoint plus all operations that occurred since the checkpoint.

Another implementation hint:

For Lab 2a, Irene sketched the state transition diagram for the VS.

What is the state transition diagram for Lab 2b?  You can describe the state transition diagram for the servers and clients, but is that enough to convince you that the system behaves as a whole like you want it to?

Instead, the behavior of the distributed system is the cross-product of the states of the individual nodes -- the state at the clients, the state at the servers, the state at the VS.  Transitions are message send/receive and local events -- e.g., if the state is that the primary has the old view and the VS has a new view, an event would be the timer expiration leading (eventually) to a new state
where the primary has the new view.

Of course many possible paths are allowable -- depending on which messages are delivered in which order, which nodes do which actions.  We can analyze the graph to determine whether the program has the property we want -- e.g., can always perform a client request; always responds with the latest value of a hash key, etc.

This is where the snapshots paper comes in: a consistent state
of a distributed system is one that is reachable through a set of
legal transitions.  In any state there are likely many possible transitions.
Maybe one client sends a message, maybe another, maybe the server fails.  

So if we want to know if the system is in deadlock, we need to take
a consistent snapshot of the system -- recording only local states,
such that if a local state depends (could be caused by) some event
on a different processor, the state on that processor must include that
event. 


CASE STUDY: HYPERVISOR-BASED FAULT-TOLERANCE
  Bressoud and Schneider (SOSP 1995)
  Detailed description of a primary/backup system
  Very general -- not application specific like the labs

Motivation
  Goal: fault tolerance / availability for any existing s/w
    by running same instructions &c on two computers
  Transparent: any app, any O/S
  Would be magic if it worked well!
  Failure model:  computer fail due to independent hw faults

Plan 1:
  [simple diagram]
  Two machines
  Identical start state: registers, program, memory contents
  Maybe they will stay identical
  If one fails, the other just keeps going, no time/data/work is lost

What will go wrong with Plan 1?
  external outputs will be duplicated
  must send external inputs to both
  inputs must arrive at the same time at both
  interrupts must arrive at same time at both
  CPUs unlikely to run at exactly the same speeds

The basic plan:
  Ignore backup's I/O instructions
  Hide interrupts from backup
  Primary forwards I/O results and interrupts to backup
    Inject those interrupts into backup

Q: do we have to be exact about timing of interrupts?
   can backup just fake an interrupt when msg arrives from primary?

How are they able to control I/O and interrupt timing?
  they slip a virtual machine / hypervisor under the O/S
  
What is a hypervisor?
  a piece of software
  emulates a real machine precisely
    so an O/S &c can run on a hypervisor
    "guest"
  as if hypervisor simulated each instruction
    and kept simulated registers, memory, &c
  BUT as much as possible runs guest on real h/w!
    can be much faster than interpreting instructions
  Many fascinating details, see VMware papers for x86 story

Hypervisor gives us control
  Suppress output from backup, detect interrupts/input at primary, &c
  Still need a scheme for delivering I/O inputs and interrupts
  From primary to backup, at the right times

But: hard to know when interrupt happened at primary, repeat at backup
  Answer: HP PA-RISC recovery counter
    It forces an interrupt every N instructions
  This allows primary and backup hypervisor to get control 
    at exactly the same point in the code

Epochs
  Primary alternates epoch and end of epoch; backup does too
  Every epoch is same # of instructions, e.g. 4000 instructions
  Primary during epoch E:
    Deliver interrupts buffered during E-1
      Interrupts deliver input, and acknowlege output
    Then continue executing
    I/O instructions start input/output (but not done until interrupt)
    Hypervisor hides new interrupts, just buffer, w/ I/O input
  Primary at end of epoch E:
    Send buffered interrupt info, including I/O input, to backup
    Wait for backup to ACK
    Send "done with E" to backup
  Backup during epoch E:
    Deliver interrupts, and I/O input,  primary told us of during E-1
    Then continue executing
    I/O instructions do nothing
    Ignore interrupts from backup devices
  Backup at end of epoch E:
    Receive and buffer interrupt/input msgs from primary
    Wait for "done with E+1" from primary
    Start epoch E+1

Note backup lags primary by one epoch
  Backup doesn't start 7 until primary is done with 7
    
What if primary fails during epoch E?
  Backup finishes with E-1
  Backup times out waiting for "done with E"

Can backup just switch to being primary for epoch E?
  No: primary might have crashed in the middle of writing device registers
  May not be correct for backup to repeat I/O register writes 
  Example:
    Maybe for a disk read the driver is required to write
      disk controller registers in a specific order,
      e.g. track #, sector #, length, DMA address
    Primary did *some* of those writes before crashing
    Backup might then write track # when h/w expecting sector #

Good news: many h/w devices can be reset to known state
  And I/O operations restarted
  Existing drivers already know how to handle "please reset me" interrupts
  At any rate, true for HP-UX disk driver

Failover strategy:
  Backup times out waiting for "done with E"
  Backup executes epoch E as backup (no output, I/O suppressed)
  Switches to primary for E+1
  Hypervisor generates "uncertain interrupts" for I/O started <= E
    Including suppressed I/O during failover epoch E
  During E+1:
    Drivers respond to uncert interrupts w/ reset, repeat of waiting I/O ops

(all this is based on footnote 5 on page 4, and IO1 and P8 on page 5)

Potential problem: repeated output
  Maybe primary crashed just as it finished E
  And it completed some output, e.g. disk write or network pkt send
  Hypervisor will generate uncertain interrupt for those I/O ops
  Backup will repeat them in E+1

Will the outside world tolerate repeated output?
  In general, maybe not
    E.g. controlling an instrument in a scientific experiment
  Network protocols:
    No problem, Must handle duplicated packets anyway
  Shared disk:
    No problem, repeated write of same data 

Do O/S drivers on primary talk directly to h/w?
  I suspect O/S talks to device simulated/mediated by hypervisor
  Hypervisor needs to be able to collect input read from
    devices and sent it to backup
  So hypervisor must be intervening (and understand) access to dev h/w
  Also hypervisor must initialize dev h/w when backup takes over

Figure 1 shows a shared disk
  Rather than separate disk per machine with identical content
  Only the primary writes the disk
  Why does shared disk make the design easier?
  Disk writes are non deterministic
    one disk may have bad sector and write fails
    so (unlike RAM) backup's disk won't naturally be identical to primary's
  Simplifies recovery of failed machine
    Don't need to copy whole disk from survivor

Won't disk failures ruin the fault-tolerance story?
  Can be fixed separately, w/ RAID
  And my memory of 1990 is that disks failed much less often than
   computers crashed

Today's question:
  What if system had private disks instead of one dual-ported disk?
  Would that make the system more fault-tolerant?
  should backup suppress disk I/O instructions?
  should backup suppress disk interrupts?
  should backup directly read its own disk?
    or should primary forward read data?

What if ethernet cable breaks?
  primary still running
  backup will try to promote itself
  that's part of what the fail-stop assumption assumes away...
  fail-stop is reasonable for a tightly coupled system
    they could probably make a 10-foot ethernet pretty reliable
    or have two separate ethernets

What if we want to re-start a failed+repaired primary?

What if the computers have multiple cores + threads?
  Would this scheme work w/o modification?

What we can expect for performance?

When will performance be bad?
  Frequent interrupts -- may be delayed by epoch scheme
  Lots of input -- must be sent over ethernet
  Many privileged instructions -- many hypervisor traps

Should epochs be short or long?
  Short means many end-of-epoch pauses + chatter
  Long means less overhead
    But I/O interrupts delayed longer

<SKIP>
What performance do they see?
  CPU-bound: Figure 2

Disk-bound (really seek-bound): Figure 3
  Why do writes do better than reads?
  Why much better relative performance for disk-bound than CPU-bound?
  Why does Figure 3 write performance level off?
    Why isn't it heading down towards 1.0, as in Figure 2?

What is the limiting factor in performance?
  442-microsecond epoch overhead
  442 us is 22100 instructions (50 mHz)
    So 22000-instruction epochs -> 2x slowdown for CPU-bound
    Plus time to emulate privileged instructions

Is the 442 microseconds CPU time, or network time?
  Figure 4 ATM experiment suggests CPU time, but not clear
<END SKIP>

Does anyone use these ideas today?
  VMware has a fault-tolerant VM system
  Same basic idea, but more complete and sophisticated
  no epochs
    primary has no restrictions on where interrupts occur
    backup can cause them to occur at same place
  primary holds each output until backup ACKs
    to ensure backup will produce same output if primary fails
    but primary can continue executing while it waits
  fault-tolerant network disks
  copes with partition by test-and-set on disk
    at most one of primary and backup will win
    no progress if network disk not reachable
  automatic creation of new backup after failure
    on some spare VM host, don't need to repair same hardware
  much faster: only 10% slow-down, not paper's 2X

PRIMARY-BACKUP REPLICATION IN LAB 2

what wider classes of failure would we like to handle?
  temporary or permanent loss of connectivity
  network partitions
  == can't know if a server is crashed or just not reachable

what is the risk of having network failure between primary/backup
  clients, primary, backup might not agree abt who is primary
    or about whether backup is alive (and thus if it is up to date)
  result:
    two primaries
    e.g. some clients switch to backup, others don't
    or one client switches back and forth
    they don't see each other's updates!
    "split brain"
    does *not* achieve goal of acting like single server