Weak consistency: cloud storage

Summary from last time: 
weak consistency for disconnected operation and source code control
allow writes while disconnected
merge writes into a consistent state
ask user for help if needed

client-server: log updates
p2p: version vectors and merged logs

1. Cloud Storage Motivation

Write availability
 Example: shopping cart -- always allow customer to buy, and merge later

Alternative: build storage system so that it is super-reliable

Multi-data center updates
  Network partitions: what if data center is partitioned?

There's also a performance issue:
 Update throughput: if (logically) a single copy of the key, 
  need either control of the copy to be ping'ed around network
  or all updates to be streamed to the single copy and all replicas

 Multiple key operations make life harder, since need to coordinate updates

As an aside: There are equivalent issues in parallel computing.
On a multiprocessor, locking allows atomic updates of multiple variables,
but that means pulling copies of those values locally.
Parallel programming across multiple machines has similar issues.

2. Three general approaches

 Snapshot reads
  Perform operations on old but consistent versions of data 
  Decreases dependencies if reads and writes don't need to be on same version

  Example: GFS -- run data analysis on some consistent prefix of the data

 Post-hoc (ad hoc?) resolution, using logs or version vectors to detect
   when reconciliation is needed

  Example: Dynamo, Ficus, git

 Commutativity: if operations can be (redesigned to be) composed

  Example: file descriptor allocation in UNIX 
   The POSIX standard says: must be next in sequence
   What if the constraint is only that fd's must be unique?
     (then the allocation can be partitioned and done in parallel 
       without communication)

3. Dynamo

 [Designed for replication within and across a small number of data centers] 

Their Obsessions
  SLA, 99.9th percentile of delay
  constant failures
  "data centers being destroyed by tornadoes"
  "always writeable"

Where does that take us?
  available => replicas
  always writeable => allowed to write just one replica if partitioned (or primary is down)
    no paxos, no primary or master, no agreed-on "view"
  always writeable + "replicas" + partitions = conflicting versions

Thus: eventual consistency among versions
  accept writes at any replica
  allow divergent replicas
  allow reads to see stale data
  resolve conflicts when failures go away
    reader must merge and then write
  like Ficus -- but in a key/value store

Unhappy consequences of eventual consistency
  Can be several "latest version"s
  Read can yield any version
  Application must merge and resolve conflicts
  No atomic operations (e.g. no PNUTS test-and-set-write)

  Dynamo is like a standard DB when all goes well
  Like Ficus when there are failures

API: Labs
  simple k/v
  hash, not ordered, no range scans
Query model
  get(k) -> set of values and "context"
    context is version info
  put(k, v, context)
    context indicates which versions this put supersedes

Where is data placed?
  load balance, including as servers join/leave
  replicating
  finding keys, including if failures
  encourage puts and gets to see each other
  avoid conflicting versions spread over many servers

Consistent hashing
  [ring, and physical view of servers]
  node ID = random
  key ID = hash(key)
  coordinator: successor of key
    clients send puts/gets to coordinator
    join/leave only affects neighbors
  replicas at successors
    "preference list"
  coordinator forwards puts (and gets...) to nodes on preference list

Why consistent hashing?
  rather than per-item placement info, or FDS TLT
  Pro
    naturally somewhat balanced
    no central coordination needed for add/delete load
    placement is implied just by node list, not e.g. per-item info
  Con (section 6.2)
    not really balanced (why not?), need virtual nodes
    hard to control who serves what (e.g. some keys very popular)
    add/del node changes partition, requires data to shift

Failures -- two levels, w/ different techniques
  Temporary failures
  Permanent failures
  Tension:
    node unreachable -- what to do?
    if really dead, need to make new copies to maintain fault-tolerance
    if really dead, want to avoid repeatedly waiting for it
    if just temporary, hugely wasteful to make new copies
  Dynamo itself treats all failures as temporary

Temporary failure handling: quorum
  goal: do not block waiting for unreachable nodes
  goal: get should have high prob of seeing most recent put(s)
  quorum: R + W > N
    never wait for all N
    but R and W will overlap
  N is first N *reachable* nodes in preference list
    each node pings to keep rough estimate of up/down
    "sloppy" quorum, since nodes may disagree on reachable

coordinator handling of put/get:
  sends put/get to first N reachable nodes, in parallel
  put: waits for W replies
  get: waits for R replies
  if failures aren't too crazy, get will see all recent put versions

When might this quorum scheme *not* provide R/W intersection?

What if a put() leaves data far down the ring?
  after failures repaired, new data is beyond N?
  that server remembers a "hint" about where data really belongs
  forwards once real home is reachable
  also -- periodic "merkle tree" sync of whole DB

How can multiple versions arise?
  Maybe a node missed the latest write due to network problem
  So it has old data, should be superseded

How can *conflicting* versions arise?
  Network partition, different updates
  Example: Shopping basket with item X
    Partition 1 removes X, yielding ""
    Partition 2 adds Y, yielding "X Y"
  Neither copy is newer than the other -- they conflict
  After partition heal, client read will yield both versions
    B/c a quorum read may fetch both

Why not resolve conflicts on a write?
  [That is, write fails if it detects a conflict, and app must retry]
  Two potential reasons:
  - Increases latency for write operations (to resolve + write again)
  - Requires waiting for more servers during write?

How should clients resolve conflicts on read?
  Depends on the application
  Shopping basket: merge by taking union?
    Would un-delete item X
    Weaker than Bayou (which gets deletion right), but simpler
  Some apps probably can use latest wall-clock time
    E.g. if I'm updating my password
    Simpler for apps than merging
  Write the merged result back to Dynamo

How to detect whether two versions conflict?
  If they are not bit-wise identical, must client always merge+write?
  We have seen this problem before...

Version vectors
  Example tree of versions:
    [a:1]
           [a:1,b:2]
    VVs indicate v1 supersedes v2
    Dynamo nodes automatically drop [a:1] in favor of [a:1,b:2]
  Example:
    [a:1]
           [a:1,b:2]
    [a:2]
    Client must merge

What happens if two clients concurrently write?
  To e.g. increment a counter
  Each does read-modify-write
  So they both see the same initial version
  Will the two versions have conflicting VVs? (no!)

What if a client resolves a conflict, but another conflict is created?
  VVs work fine if the conflicting write is on a different server
  If conflicting writes on the same server, then it's an application problem:
    race condition writing to the same key

Won't the VVs get big?
  Dynamo deletes least-recently-updated entry if VV has > 10 elements

Impact of deleting a VV entry?
  won't realize one version subsumes another, will merge when not needed:
    put@b: [b:4]
    put@a: [a:3, b:4]
    forget b:4: [a:3]
    now, if you sync w/ [b:4], looks like a merge is required
  forgetting the oldest is clever
    since that's the element most likely to be present in other branches
    so if it's missing, forces a merge

Is client merge of conflicting versions always possible?
  Suppose we're keeping a counter, x
  x=10, then partition, incremented by 5 to x=15 both partitions
  After heal, client sees two versions, both x=15
  What's the correct merge result?
  Can the client figure it out?

Permanent server failures / additions?
  Admin manually modifies the list of servers
  System shuffles data around -- this takes a long time!
  There is no lab2-like view server that removes/adds servers
  Left to itself, system treats all failures as temporary

Is the design inherently low delay?
  No: client may be forced to contact distant coordinator
  No: some of the N nodes may be distant
  No: coordinator has to wait for W or R responses

What parts of design are likely to help limit 99.9th pctile delay?
  This is a question about variance, not mean
  Bad news: consulting multiple nodes for get/put is a lose
    time = max(servers), if you have to talk to lots, at least one will be slow
  Good news: Dynamo only waits for W or R out of N
    cuts off tail of delay distribution
    e.g. if nodes have 1% chance of being busy with something else
    or if a few nodes are broken, network overloaded, &c

No real Eval section, only Experience

How does Amazon use Dynamo?
  shopping cart (merge)
  session info (maybe Recently Visited &c?) (most recent TS)
  product list (mostly r/o, replication for high read throughput)

They claim main advantage of Dynamo is flexible N, R, W
  What do you get by varying them?
  N-R-W
  3-2-2 : default, reasonable fast R/W, reasonable durability
  3-3-1 : fast W, slow R, not very durable, not useful?
  3-1-3 : fast R, slow W, durable
  3-3-3 : ??? reduce chance of R missing W?
  3-1-1 : not useful?

They had to fiddle with the partitioning / placement / load balance (6.2)
  Old scheme:
    Random choice of node ID meant new node had to split old nodes' ranges
    Which required expensive scans of on-disk DBs
  New scheme:
    Pre-determined set of Q evenly divided ranges
    Each node is coordinator for a few of them
    New node takes over a few entire ranges
    Store each range in a file, can xfer whole file

How useful is ability to have multiple versions? (6.3)
  I.e. how useful is eventual consistency
  This is a Big Question for them
  6.3 claims 0.00113% of reads see multiple versions
    Is that a lot, or a little?
    [ seems like 0.05887% of requests returned no version? ]
  So perhaps 0.00113% of writes benefitted from always-writeable?
    I.e. would have blocked in primary/backup scheme?
  But maybe not!
    They seem to say divergent versions caused by concurrent writes
    Not by e.g. disconnected data centers
    Concurrent writes maybe better solved w/ test-and-set-write

Performance / throughput (Figure 4, 6.1)
  Figure 4 says average 10ms read, 20 ms writes
    the 20 ms must include a disk write
    10 ms probably includes waiting for R/W of N
    so nodes are all in same datacenter? all in same city? paper doesn't say.
  Figure 4 says 99.9th pctil is about 100 or 200 ms
    Why?
    "request load, object sizes, locality patterns"
    does this mean sometimes they had to wait for coast-coast msg? 

Wrap-up
  Big ideas:
    eventual consistency
    partitioned operation
    allow conflicting writes, client merges
  Maybe only way to get high availability + no blocking on WAN
    But no evidence that entire sites are partitioned
    PNUTS design implies Yahoo thinks not a problem
    (but later PNUTS follow-on said they added a Dynamo-like mode)
  Awkward model for some applications (stale reads, merges)
    This is hard for us to tell from paper
  No agreement on whether it's good for storage systems
    Unclear what's happened to Dynamo at Amazon in the meantime
    Almost certainly significant changes (2007->2014)


How to think about VTs for file synchronization?
  They detect whether there was a serial order of versions
  I.e. when I modified the file, had I already seen your modification?
    If yes, no conflict
    If no, conflict
  Or:
    A VT summarizes a file's complete version history
    There's no conflict if your version is a prefix of my version

What about file deletion?
  Can H1 just forget a file's VT if it deletes the file?
    No: when H1 syncs w/ H2, it will look like H2 has a new file.
  H1 must remember deleted files' VTs.
  Treat delete like a file modification.
    H1: f=1  ->H2 
    H2:           del  ->H1
    second sync sees H1:<1,0> H2<1,1>, so delete wins at H1
  There can be delete/write conflicts
    H1: f=1  ->H2  f=2
    H2:            del  ->H1
    H1:<2,0> vs H2:<1,1> -- conflict
    Is it OK to delete at H1?

How to delete the VTs of deleted files?

Is it enough to wait until all hosts have seen the delete msg?
  Sync would carry, for deleted files, set of hosts who have seen del

"Wait until everyone has seen delete" doesn't work:
  H1:                           ->H3        forget
  H2: f=1 ->H1,H3 del,seen ->H1                   ->H1
  H3:                             seen ->H1
  H2 needs to re-tell H1 about f, deletion, and f's VT
    H2 doesn't know that H3 has seen the delete
    So H3 might synchronize with H1 and it *would* then tell H1 of f
    It would be illegal for to to disappear on H1 and re-appear
  So -- this scheme doesn't allow hosts to forget reliably

Working VT GC scheme from Ficus replicated file system
  Phase 1: accumulate set of nodes that have seen delete
    terminates when == complete set of nodes
  Phase 2: accumulate set of nodes that have completed Phase 1
    when == all nodes, can totally forget the file
  If H1 then syncs against H2,
    H2 must be in Phase 2, or completed Phase 2
    if in Phase 2, H2 knows H1 once saw the delete, so need not tell H1 abt file
    if H2 has completed Phase 2, it doesn't know about the file either

A classic problem with VTs:
  Many hosts -> big VTs
  Easy for VT to be bigger than the data!
  No very satisfying solution

Many file synchronizers don't use VTs -- e.g. Unison, rsync
  (as mentioned earlier)
  File modification times enough if only two parties, or star
  Need to remember "modified since last sync"
  VTs needed if you want any-to-any sync with > 2 hosts


Bayou:
Paper context:
  Early 1990s (like Ficus)
  Dawn of PDAs, laptops, tablets
    H/W clunky but clear potential
    Commercial devices did not have wireless
  No pervasive WiFi or cellular data

Let's build a meeting scheduler
 Only one meeting allowed at a time (one room).
 Each entry has a time and a description.
 We want everyone to end up seeing the same set of entries.

Traditional approach: one server
  Server processes requests one at a time
  Checks for conflicting time, says yes or no
  Updates DB
  Proceeds to next request
  Server implicitly chooses order for concurrent requests

Why aren't we satisfied with central server?
 I want my calendar on my iPhone.
   I.e. database replicated in every node.
   Modify on any node, as well as read.
 Periodic connectivity to net.
 Periodic bluetooth contact with other calendar users.

Straw man 1: merge DBs.
 Similar to iPhone calendar sync, or file sync.
 Might require lots of network b/w.
 What if there's a conflict? IE two meetings at same time.
   iPhone just schedules them both!
   But we want automatic  conflict resolution.

Idea: update functions
  Have update be a function, not a new value.
  Read current state of DB, decide best change.
  E.g. "Meet at 9 if room is free at 9, else 10, else 11."
    Rather than just "Meet at 9"
  Function must be deterministic
    Otherwise nodes will get different answers

Challenge:
  A: staff meeting at 10:00 or 11:00
  B: hiring meeting at 10:00 or 11:00
  X syncs w/ A, then B
  Y syncs w/ B, then A
  Will X put A's meeting at 10:00, and Y put A's at 11:00?

Goal: eventual consistency
  OK for X and Y to disagree initially
  But after enough syncing, everyone should agree

Idea: ordered update log
  Ordered list of updates at each node.
  DB is result of applying updates in order.
  Syncing == ensure both nodes have same updates in log.

How can nodes agree on update order?
  Update ID: <time T, node ID>
  Assigned by node that creates the update.
  Ordering updates a and b:
    a < b if a.T < b.T or (a.T = b.T and a.ID < b.ID)

Example:
 <10,A>: staff meeting at 10:00 or 11:00
 <20,B>: hiring meeting at 10:00 or 11:00
 What's the correct eventual outcome?
   the result of executing update functions in timestamp order
   staff at 10:00, hiring at 11:00

What's the status before any syncs?
  I.e. content of each node's DB
  A: staff at 10:00
  B: hiring at 10:00
  This is what A/B user will see before syncing.

Now A and B sync with each other
  Both now know the full set of updates
  Can each just run the new update function against its DB?
    A: staff at 10, hiring at 11
    B: hiring at 10, staff at 11
  That's not the right answer!

Roll back and replay
  Re-run all update functions, starting from empty DB
  Since A and B have same set of updates
    they will arrive at same final DB
  We will optimize this in a bit

Displayed calendar entries are "tentative"
  B's user saw hiring at 10, then it changed to hiring at 11
  You never know if there's some <15,C> you haven't yet seen
    That will change your meeting time yet again
    And force re-execution of lots of update functions
  
Will update order be consistent with wall-clock time?
  Maybe A went first (in wall-clock time) with <10,A>
  Node clocks unlikely to be synchronized
  So B could then generates <9,B>
  B's meeting gets priority, even though A asked first
  Not "externally consistent" (Spanner gets this right...)

Will update order be consistent with causality?
  What if A adds a meeting, 
    then B sees it,
    then B deletes A's meeting.
  Perhaps
    <10,A> add
    <9,B> delete -- B's clock is slow
  Now delete will be ordered before add!
    Unlikely to work
    Differs from wall-clock time case b/c system *knew* B had seen the add

Lamport logical clocks
  Want to timestamp events s.t.
    node observes E1, then generates E2, TS(E2) > TS(E1)
  Thus other nodes will order E1 and E2 the same way.
  Each node keeps a clock T
    increments T as real time passes, one second per second
    T = max(T, T'+1) if sees T' from another node
  Note properties:
    E1 then E2 on same node => TS(E1) < TS(E2)
    BUT
    TS(E1) < TS(E2) does not imply E1 came before E2

Logical clock solves add/delete causality example
  When B sees <10,A>,
    B will set its clock to 11, so
    B will generate <11,B> for its delete

Irritating that there could always be a long-delayed update with lower TS
  That can cause the results of my update to change
  Would be nice if updates were eventually "stable"
    => no changes in update order up to that point
    => results can never again change -- you know for sure when your meeting is
    => no need to re-run update function

How about a fully decentralized "commit" scheme?
  You want to know if update <10,A> is stable
  Have sync always send in log order -- "prefix property"
  If you have seen updates w/ TS > 10 from *every* node
    Then you'll never again see one < <10,A>
    So <10,A> is stable
  Spanner does this within a Paxos replica group
  Why doesn't Bayou do something like this?

How does Bayou commit updates, so that they are stable?
 One node designated "primary replica".
 It marks each update it receives with a permanent CSN.
   Commit Sequence Number.
   That update is committed.
   So a complete time stamp is <CSN, local-TS, node-id>
 CSN notifications are exchanged between nodes.
 The CSNs define a total order for committed updates.
   All nodes will eventually agree on it.
   Uncommitted updates come after all committed updates.

Will commit order match tentative order?
  Often yes.
  Syncs send in log order (prefix property)
    Including updates learned from other nodes.
  So if A's update log says
    <-,10,X>
    <-,20,A>
  A will send both to primary, in that order
    Primary will assign CSNs in that order
    Commit order will, in this case, match tentative order

Will commit order always match tentative order?
  No: primary may see newer updates before older ones.
  A has just: <-,10,A> W1
  B has just: <-,20,B> W2
  If C sees both, C's order: W1 W2
  B syncs with primary, gets CSN=5.
  Later A syncs w/ primary, gets CSN=6.
  When C syncs w/ primary, order will change to W2 W1
    <5,20,B> W1
    <6,10,A> W2
  So: committing may change order.
  
Committing allows app to tell users which calendar entries are stable.

Nodes can discard committed updates.
  Instead, keep a copy of the DB as of the highest known CSN.
  Roll back to that DB when replaying tentative update log.
  Never need to roll back farther.
    Prefix property guarantees seen CSN=x => seen CSN<x.
    No changes to update order among committed updates.

How do I sync if I've discarded part of my log?
 Suppose I've discarded all updates with CSNs.
 I keep a copy of the stable DB reflecting just discarded entries.
 When I propagate to node X:
   If node X's highest CSN is less than mine,
     I can send him my stable DB reflecting just committed updates.
     Node X can use my DB as starting point.
     And X can discard all CSN log entries.
     Then play his tentative updates into that DB.
   If node X's highest CSN is greater than mine,
     X doesn't need my DB.

How to sync?
  A sending to B
  Need a quick way for B to tell A what to send
  Committed updates easy: B sends its CSN to A
  What about tentative updates?
  A has:
    <-,10,X>
    <-,20,Y>
    <-,30,X>
    <-,40,X>
  B has:
    <-,10,X>
    <-,20,Y>
    <-,30,X>
  At start of sync, B tells A "X 30, Y 20"
    Sync prefix property means B has all X updates before 30, all Y before 20
  A sends all X's updates after <-,30,X>, all Y's updates after <-,20,X>, &c
  This is a version vector -- it summarize log content
    It's the "F" vector in Figure 4
    A's F: [X:40,Y:20]
    B's F: [X:30,Y:20]

How could we cope with a new server Z joining the system?
  Could it just start generating writes, e.g. <-,1,Z> ?
  And other nodes just start including Z in VVs?
  If A syncs to B, A has <-,10,Z>, but B has no Z in VV
    A should pretend B's VV was [Z:0,...]

What happens when Z retires (leaves the system)?
  We want to stop including Z in VVs!
  How to get out the news that Z is gone?
    Z sends update <-,?,Z> "retiring"
  If you see a retirement update, omit Z from VV
  How to deal with a VV that's missing Z?
  If A has log entries from Z, but B's VV has no Z entry:
    Maybe Z has retired, B knows, A does not
    Maybe Z is new, A knows, B does not
  Could scan both logs, but would be expensive
    And maybe retirement update has committed and dropped from B's log!
  Need a way to disambiguate: Z missing from VV b/c new, or b/c retired?

Bayou's retirement plan
  Z's ID is really <Tz,X>
    X is server Z first contacted
    Tz is X's logical clock
    X issues <-,Tz,X>:"new server Z"
    Z gets a copy of the new server update
      logical clock orders "new server Z" before any of Z's updates
  So, A syncs to B, A has log entries from Z, B's VV has no Z entry
  Z's ID is <20,X>
  One case:
    B's VV: [X:10, ...]
    10 < 20 implies B hasn't yet seen X's "new server Z" update
  Another case:
    B's VV: [X:30, ...]
    20 < 30 implies B once knew about Z, but then saw a retirement update
  More complex case:
    B's VV doesn't even contain an entry for X
    X is itself <-,Tx,W>, maybe B has an entry for W
    So B can decide if X is new or retired (see above)
    If X is new to B, Z must also be new (== B can't have seen X's "new server Z")
    If X is retired, i.e. B saw X's retirement write,
      B must have seen "new server Z" by prefix property,
      so Z missing from B's VV => B knows Z is retired

How did all this work out?
  Replicas, write any copy, and sync are good ideas
    Now used by both user apps *and* multi-site storage systems
  Requirement for p2p interaction when not on Internet is debatable
    iPhone apps seem to work fine by contacting server via cell-phone net
  Central commit server seems reasonable
    I.e. you don't need pure peer-to-peer commit
    Protocol much simpler since central server does all resolution
  Bayou introduced some very influential design ideas
    Update functions
    Ordered update log is the real truth, not the DB
    Allowed general purpose conflict resolution
  Bayou made good use of existing ideas
    Eventual consistency
    Logical clock