Callback cache coherence

1. Write-through with callbacks

Recall: write-through means all writes go to server; reads are cached.
We'll see write-back (writes can be cached) in a bit.

Leases can be inefficient: need to check server repeatedly 
even if data hasn’t changed.

Suppose lease is infinite: in order to safely modify data, need to ask
reader to give up the lease, via a callback

Record state at server as to who has which cached copy, so 
that server can inform client when its cached data is invalid. 

States for write-through cache coherence:  
State is per-client/per-memory object.  

invalid, read-only

Label transitions between the states

Careful implementation needed: when does the write return?
As soon as the server is updated?  Before the invalidations, or afterward?

Careful implementation needed: deadlock possible if client waits for server and server waits for client

For linearizability, when a node updates the server (write-through), 

a. server invalidates everyone who has a copy.  
b. server doesn't apply write until all copies have been invalidated.  
c. server returns to client only after it has applied the write
d. After invalidation, clients will go back to the server to get the 
   new copy when needed.
e. server needs to queue new requests for that location until write completes

Illustrate with two concurrent writes, to two concurrent readers.  Readers have item cached.  Writers send change through to server; order is the order they reach the server.  Server uses callback state to invalidate caches.  Then reader has a cache miss, and fetches the value from the server.

Note race conditions in the implementation: 

A client might have data invalidated, and go back to the server, before the remaining clients have received the invalidation.  

What if we provided the old data to a location with in progess invalidation?  
Not even eventually consistent!

What if we provide the new data?  Might be ok, but need to ensure that 
updates appear at each client in the order they are applied 
at the server.   E.g., if update A, B, C – then shouldn’t see C’ while 
I can still read A (if some other node can read A’ and C) – that’s 
not serializable.

And if the server is sharded, we need to ensure either that the client sends only one request at a time, or the updates are applied in order.
(Note in the lab: one request at a time per client!)

So the state machine at the server is quite a bit more complicated:
not cached, cached read-only {set of readers}, 
write pending {invalidations sent}, new read/write request pending

Relative to strong leases:
no assumption about clock synchronization, but no progress if network fails.
If server can tell -- for sure -- that client has failed, then it can
revoke lease safely, as long as client reconnects when it comes back up.

So for instance, you can combine techniques: a short lease on client being up;
the client must continually revalidate that it is alive, and if it fails to
hear from server, it has to revalidate everything in its cache until 
the server reconnects.  Then the individual items can be cached 
using callbacks.

Write-through means that we must contact the server on every file 
modification.  The pesky little brother now says: there I changed the data.
There I changed it again.  And again.  

Pretty soon you’d say: enough already!  Just tell me when you log out!

2. Write-back cache coherence

Write back cache coherence allows for changes to be kept at the client,
and flushed to the server later.  (Of course, this means that if the 
client crashes in the meantime, you might lose some of its updates.)

Illustrate state machine for write back cache coherence: owned, read-only, invalid, for each client x each memory object.   

Constraints: owned by at most one client (at a time)
Read-only at any client => not owned by any client

What causes transition for each state?

Illustrate original example, using write back cache coherence.

CPU0:
    v0 = f0();
    done0 = true;

  CPU1:
    while(done0 == false)
      ;
    v1 = f1(v0);          
    done1 = true;

  CPU2:
    while(done1 == false)
      ;
    v2 = f2(v0, v1);

What cache misses occur, what data is transferred?  Initially, nothing cached.  Eventually, v1 and v2 gets the right value. 

Efficiency of write back versus write through: write back is more efficient if there are repeated writes to the same location; in this example there are no repeated writes.  So would write through do the same thing as write back, or does write back have to do more work if there are no repeated writes?

Where is write-back coherence used?

a) shared-memory multiprocessors
b) distributed shared memory

DSM idea: run shared-memory parallel program across a network of servers.
Each page can be invalid, read-only, or owned on each server

Virtual memory faults trigger coherence actions: 
read to an invalid page
write to an invalid page
remote read to an owned page
remote write to an owned page

(Could also use compiler techniques to do coherence on a per-object basis.)

We’ve been assuming that each variable is independently cached.  What happens if some variables are used differently, but are in the same unit of sharing (cache line, distributed object, file)?  That is called: false sharing.   

To avoid it, compiler/programmer needs to analyze sharing pattern, and put each shared variable on a page with other variables used at the same time/in the same way.

Here’s another example: granularity of sharing.  mesh computation.  Shared data at the edges between each CPU’s region.  What if mesh is large, so that each page of data represents a row – then need to transfer the entire mesh on each time step! (Possible solution, used in SVM systems: send only the diffs between versions of the page.)

Earlier I said RPC and cache coherence are duals – move computation to data versus move data to computation.  Which is the more efficient way to implement the program above?

Other questions:

**What state do we need to keep for the various levels of coherence?  TTL: none!

***Scalability of directory information: to be serializable, need to have a bit per processor, or a list of processors caching each block.  (why there isn’t cache coherence of web pages.)  We could try to scale by distributing the callback state as a tree of callbacks.

***We mentioned that NFS allows for transparent recovery when server or clients fail, using idempotent operations and TTL based coherence.  Can we get transparent server recovery with callbacks? (Recover by asking clients what server state was.)

****Why use write back coherence vs. just write through?  (If data is written repeatedly.)

***Why not always use write back coherence?  Any data that is owned
by a failed node cannot be accessed until it recovers, unless local 
changes are logged (e.g., to a network attached disk)
to allow remote recovery.

Some examples of cache coherence in practice.

Ivy: Illusion of shared-memory, using virtual memory paging hardware.
Page table marked as invalid, read-only, read-write (owned).
Take page fault to transition between states.

Why is Ivy cool?
  All the advantages of *very* expensive parallel hardware.
  On cheap network of workstations.
  No h/w modifications required!

Do we want a single cache coherent address space?
 Or perhaps we just want programmer/compiler to issue remote references to remote data, where programmer should manage bringing the data back and forth?

Assume that each variable is on its own virtual memory page.  What happens if they are both on the same page?

does Ivy provide linearizability or sequential consistency?
 
Invariants:
    1. every page has exactly one current owner
    2. current owner has a copy of the page
    3. if mapped r/w by owner, no other copies
    4. if mapped r/o by owner, maybe identical other r/o copies
    5. manager knows about all copies

Ivy does seem to use our two seq consist implementation rules.  You
 can construct a total order always:
  1. Each CPU to execute reads/writes in program order, one at a time
  2. All writes are in a total order (manager orders them)
  3. Once read observes effect of write, it is partially ordered
     behind it.  Order the reads in any total order that obeys the
     partial order.

  If you study the protocol carefully, then it is possible to
  construct an argument that there is no partial order created by the
  protocol than cannot be embedded in a total order.  All CPUs observe
  all local ops in a local total order (1).  All CPUs observe other
  CPU's operations in order that is consistent with a total order.
  For writes that is easy to see because they form a total order
  because there is only a single writer (2).  For reads the argument
  is more complex because reads can happen truly concurrent, but it
  is never the case that a read on one processor observes a result
  that is inconsistent with an observation on another processor in a
  total order.  This could only happen if a scenario like A or B above
  is possible, and the confirmation messages+locks ensure that never
  happens (3).


Examples of systems that provide weaker consistency guarantees.
DNS: multiple replicas, needed for throughput – a huge number of clients reading data from the DNS servers at the same time.  E.g., for the root DNS, several hundred replicas spread around the globe.  Caching to reduce the load on the server.

How does update work?  Update one replica, that replica replies, and in the background it propagates the change to the other replicas.

Implication: may not read your own writes!  Update DNS, read the value, and if you happen to connect to a different replica, you’ll see the old value!

Bayou paper an attempt to fix this – what constraints might an application need, even if we aren’t serializable.

Coda: What about disconnected operation? 

a)	prefetch things you will need in future
b)	write back when connected
c)	conflict resolution? 

Coda conflict resolution policy: changes appear in processor order, applied at time when notebook reconnects.  

Is that sequentially consistent?  Serializable?  

A few questions:

How would you implement google docs or windows live today?

Assume perfect connectivity?

If we wanted to allow offline work?

Seems like you would sync the entire directory, and not rely on automated hoarding. Would an explicit check in/check out model work better?

Email sync: takes minutes, even fully connected. 


Bayou: p2p disconnected operation, application-level

Idea is that you should be able to sync a set of portables, without access to a server. 

Exchange updates with neighbor; not committed until everyone sees it (and you know that no other earlier updates can occur).