Today:
 - finish up BigTable
 - overview of transactions
 Two implementations of transactions on BigTable
 - MegaStore
 - Spanner
 Recent UW research: TAPIR


----------

[from last week, since we didn't get through most of this]

BigTable
 - stores (semi)-structured data at Google
   - URLs -> contents, metadata, links
   - user -> preferences, recent queries
   - location -> map data
 - large scale
  - capacity: 100 B pages * 10 versions => 20 PB
  - throughput: 100M+users, millions of queries per second
  - latency: can only afford a few milliseconds per lookup

Why not use a commercial DB?
 - scale is too large, cost would be very high
 - low-level storage optimizations help significantly
   - hard to do this on top of a database layer
   - data model exposes locality, performance tradeoffs
 - can remove "unnecessary" features for better scalability
   - secondary indexes
   - multi-row transactions
   - integrity constraints
   [although people later decided they wanted each of these!]

Data model
 - a big, sparse, multi-dimensional table
 - (row, column, timestamp) -> cell contents
 - fast lookup on a key
 - also, rows ordered lexicographically so scans are in order
   - can do a sort or similar processing

Consistency
 - is it ACID?
 - strongly consistent: operations are processed by a single tablet
   server, in the order received
 - durability & atomicity via a commit log stored in GFS
 - what about transactions?
   - single-row only (read-modify-write, atomic compare-and-swap, etc)

Implementation
 - divide the table into tablets [~100MB], grouped by row range
 - each tablet is stored on a tablet server that manages 10-1000
   tablets
   - means that range scans are efficient because they are in the same
     tablet => same server
 - there's a master that assigns tablets to servers, rebalances across
   new/deleted/overloaded servers, coordinates splits
 - and a client library that locates the data

Is this just like GFS?
 - vaguely same architecture, but...
 - can leverage GFS and Chubby
 - tablet servers and the master are essentially stateless
   - tablet data gets stored in GFS, coordinatd w/ Chubby
   - master stores most config data in Chubby
   - if a server fails, the master just hands its list of tablets over
     to a new server that opens the GFS files
   - if the master fails, acquires a lock in Chubby, reads list of
     servers from Chubby, asks tablet servers what tablets they are
     responsible for
 - scalable metadata assignment
   - don't store the entire list of tablet -> server mappings in the
     master
     - would make the master a bottleneck for reads especially
     - hierarchical approach:
       - maintain a table of tablet -> IP/port ranges
       - store that in a series of tablets
       - store the list of those in one tablet
       - store that tablet's location in Chubby

Storage on one tablet
 - most data is stored in a set of SSTable files: sorted key-value
   pairs
 - writes go into a in-memory table + GFS log
 - periodically move data from the memtable to SSTables
   - basic approach: read all the SSTables in, remove stale data,
     merge them, write them back out
 - each machine responsible for about 100 tablets
  - so if it fails, get another 100 machines to pick up 1 tablet each

BigTable retrospective
 - definitely a scalable system!
 - used at Google still
 - motivated a lot of the NoSQL world
 - biggest mistake: not supporting distributed transactions
  [per Jeff Dean]
  - became really important as incremental updates started to take over
  - lots of people wanted them, tried to implement them themselves
    (often incorrectly!)
  - at least three papers subsequently fixed this in different ways
  - we'll read two next week!


----------

Transactions

Goal: group a bunch of individual operations into a larger atomic
action
  - e.g., savings_balance -= 100; checking_balance += 100
  - don't want to see one without the other
    - even if the system crashes [atomicity/durability]
    - even if other transactions are running concurrently [isolation]
      - we are mostly going to talk about this one

How do we traditionally achieve these?
 - i.e., in a traditional single-node database
 - atomicity/durability: write-ahead logging
   - write each operation that we want to do to a log to disk
   - then write a commit record that makes all ops commit
   - only tell the client the transaction is done after writing commit
     record
 - isolation: concurrency control, usually strict two-phase locking
   - keep a lock per object/DB row, usually single-writer/multi-reader
   - when reading or writing an object, acquire the appropriate lock
   - hold all locks until after commit (and the commit record is
     written)
   - release locks together after commit

Note that I am oversimplifying here
 - just the single-node case is the topic of half of a database class
 - actually making those solutions work and be efficient is
   fiendishly difficult
 - ...so let's go do something even more complicated
 - what makes it harder in a distributed system? 
   - savings_balance and checking_balance might be on different nodes
   - they might be replicated or cached
   - need to coordinate the ordering of operations across them too!

What does correctness mean for isolation, anyway?
 - usual definition: serializability
   each transaction's reads and writes are consistent with what you'd
   see if they executed in some serial order, one transaction at a
   time, on a single processor
 - strict serializability = linearizability
   - same definition + real-time requirement
   - traditional database papers usually don't talk about this much
     because it goes without saying
   - single node S2PL provides this

Can we have weaker levels of isolation?
 - after all, we had weaker levels of consistency:
   - causal consistency, eventual consistency, etc
 - yes: these allow different anomalies:
   non-serial behavior of transactions
 - e.g., see part of one operation without the others
 - snapshot isolation, repeatable read, read committed, etc.

Weak isolation vs weak consistency
 - at strong consistency levels, these are the same:
   linearizability/serializability:
    - there is a single serial order of all transactions
 - weaker isolation:
   - operations aren't necessarily atomic, e.g.,
     - savings -= 100                           checking += 100
                       read savings, checking
     - but everyone agrees on what sequence of events were seen
     - this can occur 
 - weaker consistency:
   - operations are atomic, but might disagree on order
   - e.g., A sees [savings -= 100; checking += 100] [read savings, checking]
   - B sees: [read savings, checking] [savings -= 100; checking += 100; ]
 - you could combine both weak isolation & weak consistency
   - here there be dragons
    

The traditional database solution to atomic/isolated transactions in
distributed systems: two-phase commit
 - model: now the database is partitioned over different hosts. still
   only one copy of each data item, though
 - as before, 2-phase locking: when reading or writing an object,
   acquire lock
 - when ready to commit transaction: coordinator sends
   a prepare message to all shards
 - they can either respond prepare_ok, and promise to be able to
   commit the transaction later, or abort -- last chance to abort
     - making this work usually requires writing a durable log entry!
 - then coordinator sends commit message to all shards
 - each shard then writes commit record to disk and releases logs

Why isn't this the end of the story?
 - what if there are failures, either of shards or coordinator?
   - 2PC is a blocking protocol; no progress can be made
   - some protocols to address, e.g., coordinator recovery
 - performance:
   - can we really afford to take locks for the entire duration of a
     transaction?


  ----------


MegaStore

Subsequent storage system to BigTable
 - provide an interface that looks more like SQL
   [but not exactly!]
 - provide multi-object transactions
 - as usual, Google was vague about what was actually using it
 - but later revealed: gmail, picasa, calendar, etc; also available
   thru Google App Engine

Conventional wisdom
 - hard to have consistency & performance in the wide area
    - consistency requires communication to coordinate;
      expensive in wide-area
 - hard to have consistency and availability at once
    - need 2PC across copies; what about partitions?
 - one solution: relaxed consistency [next week]
 - MegaStore: try to provide it all!
   (but note - not exactly a high performance system!)

Megastore architecture
 - each data center has
    - app server w/ megastore library
    - replication server
    - coordinator
    - BigTable cluster
 - data in BigTable is the same everywhere

Setting
 - browser web requests may arrive at any replica
   - i.e., application server at any replica
   - no designated primary replica!
   - so could easily be concurrent transactions on same data from
     multiple replicas!

Data model
 - schema has a set of tables,
   containing a set of entities with a set of properties,
 - looks basically like a SQL table, but...
   - annotations about which data are accessed together
     (IN TABLE, etc)
   - annotations about which data can be updated together:
     entity groups

Aside: some of these design choices are very non-traditional for a
database
 Key principle for relational DB: data independence
   - users should specify schema for their data, define the operations
     they want to do on it, and let the DBMS figure out how to store
     it!
 Consequence: performance impact is not transparent
   - easy to write a query that will take forever
   - especially in a distributed environment: might have to join data
     from all of your partitions
 MegaStore argument:
   - make the performance choices explicit
   - make users do things that could be really expensive, like joins,
     themselves
   (do you agree?)

Translation from schema to BigTable
 - keeps related rows adjacent => on same tablet for easy scan
 
Entity groups
 - transactions can only use data within a single entity group
 - one row or a set of related rows, defined by application
 - e.g., all my gmail messages in 1 entity group, yours in different
   one
 - example transaction: move message 321 from [my] Inbox to Personal
 - not a transaction: deliver message to Dan, Haichen, Adriana

How to implement transactions?
 - each entity group has a transaction log, stored in BigTable
 - data in BigTable is the result of executing log operations
 - to commit a transaction, use Paxos to add an entry to the log
 - basically, like lab 3 except that log entries are transactions
   instead of single ops!

More specifically:
 - find highest paxos log entry number (N)
 - read data from local bigtable
 - accumulate writes in temporary storage
 - create a log entry that is the set of writes
 - use paxos to agree that the new entry is log entry n+1
 - apply writes in log entry to bigtable data

What does this mean?
 - need to wait for inter-data-center messages for commit!
 - only a majority of replicas need to respond
 - might be missing log entries; need to repair this later
 - there isn't a leader (unlike Chubby), so definite risk of conflicts
 - might be conflicting transactions

What about concurrent transactions?
 - suppose two concurrent operations try to commit, modify X
 - MegaStore must abort one, allow the other to commit
 - achieves this by catching conflicts during Paxos agreement
   - Paxos allows only one entry to get entry N+1
   - other application server will learn that it didn't win
   - needs to retry the whole transaction
 - Does this work?
   - actually, it prohibits *any* concurrency within an entity group!
   - even if the transactions do not actually conflict!
   - traditional DB locking would not have this problem: would allow
     concurrency on non-overlapping data

Paxos tricks
 - distinguishing a proposer
   - we talked about this before in the context of Multi-Paxos and
     Chubby
   - Megastore uses a slightly different trick: last writer to the log
     for an entity group gets to be the distinguished proposer for
     next write
 - witnesses
   - can have up to f replicas that don't actually apply the log, just
     write down log entries
   - still running Paxos, but don't have to update their bigtable
   - can be promoted to full replicas if one fails

What about transactions across entity groups?
  - out of luck?
  - two-phase commit?

Performance
 - a couple transactions per second per entity group
 - 10s of milliseconds for a read
 - 100s of milliseconds for a write
 - Is this OK?

----------

Spanner
 - subsequent system [2012] from Google
 - backend for the F1 databases, which stores Google's ads database
   - before that it was a mess of partitioned & replicated mysql
    instances
 - addresses limitations of MegaStore
   - no notion of entity groups, no restrictions on transaction scope
   - can have more than one concurrent transaction at a time
   - support lock-free read-only transactions


Example: social network
 - simple schema: user posts and friends lists
 - sharded across thousands of machines
 - replicate data across multiple continents

Read-only transaction example:
 - generate page of friends' recent posts
 - read friends list, read their posts
 - what if I remove friend X, post mean comment about him?
 - locking answer:
   - acquire read locks on friends lists, acquire read locks on our
     posts
   - prevents them from being modified while we run this transaction
 - this could be really slow!
 - hence, want to support lock-free r/o transactions

Spanner architecture
 - each shard (tablet) is stored in a Paxos group
   - replicated across multiple data centers
   - has a relatively stable leader
   - on each replica, basically stored like BigTable in GFS
   - use Paxos to ensure that each replica has the same state
 - transactions can span Paxos groups
   - implemented as 2PC on top of Paxos groups!
   - leader of each Paxos group manages a lock table for concurrency
     control
        - could have every replica do it; this is an optimization
   - one paxos group leader becomes the 2PC coordinator, others the
     participants 

quote

Basic 2PC/Paxos approach:
  - during execution, read and write objects
    - contact the appropriate Paxos group leader, acquire locks
  - client decides to commit, notifies the coordinator
  - coordinator contacts all shards, sends PREPARE message
    - they Paxos-replicate a prepare log entry (including locks),
      vote either ok or abort
  - if all shards vote OK, coordinator sends commit message
    - each shard Paxos-replicates commit entry
    - leader releases locks

This is just standard techniques
 - 2PC, but replacing disk log writes with Paxos replicated log
 - same architecture in various (mostly research) systems

What's left?
 - lock-free r/o transactions
 - how to do this? timestamps!
   - assign meaningful timestamp to transaction
   - order transactions meaningfully
   - then, reasonable to say, r/o transaction X reads at timestamp 10

TrueTime
 - many people think the magic here is atomic clocks and GPS
 - this is actually pretty standard stuff! (NTP, etc)
 - the key idea here is exposing the clock *uncertainty*!

Truetime API:
 - TTInterval tt = TT.now()
 -  "correct" time is "guaranteed" to be between tt.latest and
    tt.earliest
 - what does this actually mean?

Implementing TrueTime
 - have set of time masters (GPS clocks, atomic clocks) in each data
   center
 - NTP or similar protocol syncs with multiple masters,
   detects/rejects faulty ones
 - TT returns local clock value, plus uncertainty
   - uncertainty = time since last sync * 200 usec / sec
   - where did that number come from?

Assigning a timestamp to a transaction
 - consider the 2PC commit procedure. When can we assign a timestamp?
   - any time between when all locks are held and the first lock
     released (why?)
 - what timestamp do we pick? one that TrueTime says is in the future
 - how do we know that it's consistent with global time?
   - wait until TrueTime says it's in the past
   - this requires waiting out clock uncertainty!
 - leads to external consistency = linearizability

Spanner actually does something a little more complicated
 - 2PC coordinator gathers a number of TrueTime timestamps
 - prepare timestamps from non-coordinator shards
 - timestamp that the coordinator received the commit message
 - highest timestamp assigned to any previous transaction
 - actually picks a timestamp that is greater than all of these, 
   plus TT.latest
 - like a Lamport clock

What does commit wait mean?
 - larger uncertainty bound from TrueTime => longer commit wait period
 - longer commit wait period => locks held longer => lower throughput
 - so time uncertainty means Spanner gets slower!
 
What does this buy us?
 - can now do a read-only transaction at a particular timestamp, have
   it guaranteed to be meaningful
 - assuming a multiversion store where we keep the old versions around
 - can use this to do serializable ("globally consistent")
   transactions in the past
 - or in the present: pick a future timestamp, wait for it to be in
   the past

What if TrueTime fails?
 - Google argument: picked using engineering considerations; more
   unlikely than a CPU failure and we don't handle those anyway
 - but what if it went wrong?
 - can cause very long commit wait periods, making system very slow
 - can break ordering guarantees: no longer have external consistency
   (linearizability)
 - but still guaranteed serializability because the approach of taking
   the max of gathered timestamps is essentially a Lamport clock