Today: three real-world systems from Google
  - GFS: large-scale storage
  - BigTable: scalable storage of semistructured data
  - Chubby: coordination to support other services

Each of these systems has been very influential
 - lots of open-source clones [not exactly identical!]
   - GFS -> HDFS
   - BigTable -> HBase, Cassandra
   - Chubby -> ZooKeeper
 - also 10+ years old (2006, 2006, 2003)
   - and had been in use for a few years before then
     (until Google felt they could tell us about it!)
   - so these systems have undergone major changes since then
   - some have been replaced by successors (e.g., GFS -> Colossus)
   - some have successors but are still around
     - need to support legacy code even internally!

These are real systems
 - as are most of the papers we'll read for the rest of the class:
   either real systems from industry or recent research systems
   [as opposed to theoretical papers like Paxos]
 - they are not necessarily the best design
   - in fact, some times the authors have pointed this out!
 - lots of opportunities for discussion about whether this is the best
   solution for this problem, or other problems
     - or whether the problem they were solving was even the right
     one!
 - lots of interesting anecdotes about side problems, esp in Chubby
   paper


First system: Chubby
 - one of the first distributed "coordination services"
 - allow client apps to synchronize themselves and their environment
 - e.g., select a GFS master
 - e.g., find the BigTable directory
 - e.g., be the view service from Lab 2
 - internally, a Paxos replicated system

History:
 - you might imagine:
   - Google has a lot of services that need reliable coordinatoin
   - they are doing ad-hoc things that have bugs
   - Paxos is a better answer but it's complicated
   - so they decided to build a better service
 - in fact:
   - first attempt did not use Paxos
   - used Berkeley DB's commercially available replication
     [see "Paxos Made Live"]
   - this did not go well; major source of bugs!
     - not clear what algorithm it is using or whether it is correct
   - migrating data to/from BDB was a major source of bugs for early
     Paxos-Chubby

Interface
 - simple file system like interface
   - hierarchical files, usually pretty small (couple KB)
   - Open/Close
   - GetContents, SetContents, Delete
   - Acquire, TryAcquire, Release
   - GetSequencer, SetSequencer, Check Sequencer [more later]
    [plus a few other details; not many]
   - non-mandatory locking - why?
     - really up to the application to decide what holding a lock
       might mean -- might be more than just file contents!
     - failure recovery, debugging, administrator access wants to
       bypass the lock

Example
 - primary election
   x = Open("/ls/cell/service/primary")
   if (TryAcquire(x) == success) {
     // I'm the primary, tell everyone
     SetContents(x, my-address)
   } else {
     // I'm not the primary, find out who is
      primary = GetContents(x)
     // also set up notifications in case the primary changes
   }

Why this interface?
 - why not a Paxos consensus library?
 - one answer: Paxos is hard, nobody knows how to use it
  - apparently they don't know how to use locks either but they think
    they do. :-)
  - has this changed since then?
 - maybe better answer: backwards compatibility
 - other answer: want to advertise results outside the system
   - e.g., tell all the clients where the BigTable root is, not just
     the master
 - another answer: want a separate set of nodes to run consensus
   rather than your service
    - allows the system to work even with majority failures
    - e.g., chain replication
 

Recall state machine replication:
  - system maintains state, (state, input) -> (new state, output)
  - i.e., system state & output entirely determined by inputs
  - then replication is just a problem of agreeing on the order of
    inputs
      - and Paxos can help
  - what does this require?
    - need determinism: clocks, randomness, etc need to be handled specially
    - parallelism is tricky
    - servers can't communicate "outside" of the system, i.e., except
    through state machine ops
  - state machine replication is a great way to build a distributed
    system from scratch!
  - but really hard to retrofit to an existing system!

Big picture of implementation:
  - 5 replicas in a cell, one is the master, Paxos replicates a log of
    ops
  - should sound familiar!

Challenge: performance
  - note that Chubby is not a high performance system!
    < 1000 writes/sec originally
  - but definitely more than this many incoming RPCs
  - but don't want to replicate *every* operation!

Baseline: Multi-Paxos
  - from last lecture
  - quick recap: one node is the leader, it runs the first phase of
    Paxos ahead of time for all instances, reserves the right to write
    to any value in the log
  - then the leader can commit new operations with one phase (one RT
    to a majority of replicas)
  - if another node suspects the leader is faulty, it runs a view
    change protocol: full two-phase version of Paxos
      - this makes sure it gets all the ops the old leader committed
      - and the first phase locks out the old leader from committing
        new ops

Paxos performance:
  - last time: batching and partitioning
  - other ideas in the paper: leases and caching
  - other ideas?

Leases:
  - observe: in a Paxos system, much like Lab 2, the primary can't
    unilaterally respond to any request, even reads!
     - why? what goes wrong?
     - rest of system decides the primary has failed, assembles a new
       quorum
     - rest of system starts committing new writes without telling the
       old primary
     - but the old primary thinks it's still the primary, so still
       answers reads with the old version
  - standard answer is to use coordination (Paxos) on all requests but
    this can be slow
  - pretty common optimization: give the leader a "lease"
  - time-based, say 10 seconds (renewable by contacting a majority of
     replicas)
  - leader is allowed to respond to reads unilaterally while it has
    the lease
  - what do we have to do when we change leader? wait out the lease
    period -- why?

Caching:
  - our usual strategy. and our usual questions about caching
  - what does Chubby cache?
    - file data, metadata
    - includes the fact that a file is not present
  - what consistency level does Chubby provide?
    - strictly consistent: linearizable operations
    - note that Zookeeper does not do this -- thought it was
      unnecessary
    - and the Chubby paper says it was too for some operations

How does Chubby implement caching?
  - kind of like the DSM paper
  - client maintains a local cache
  - master keeps a list of which clients might be caching
  - master sends invalidations (not updates -- why?)
     - might not need the newly updated version, then you're sending
       updates out forever
  - cache entries have leases -- so they expire automatically after a
    few seconds
     - gives us an answer if the clients don't respond quickly to
     invalidation

A surprising use case:
  - programmers often wound up using Chubby for service discover
    instead of DNS
  - why? better consistency
  - how does DNS caching work?
    - purely time based: each entry has a configurable expiration time
    - if really high: slow to update (one day is common!)
    - if low: cache is no good!
  - what would Chubby do instead?
    - clients get to keep it in their cache
    - server invalidates when it changes
    - just need to periodically check in to renew the cache lease
  - is this better?
    - definitely, if items change infrequently but we want a notice
    - way more complicated for the server -- so DNS probably shouldn't
      do this!

Client failure:
 - clients have a persistent connection to Chubby
 - need to acknowledge it with periodic keep-alives (~10 seconds)
 - if none received, Chubby server treats it as failed, removes its
   cache entries from its table, drops any locks it's holding, etc


Master failure:
  - from the client's perspective:
    - don't hear from the master in a timeout
    - declare session in jeopardy, tell the app
    - clear cache, any Chubby ops block
    - after grace period (45 secs), give up, assume the session was
      lost
  - from the system's perspective:
    - run a Paxos round to elect a new master
    - increment a master epoch number (like a view number!) to keep
      them straight
    - new master rebuilds its database with all the ops from the old
      primary
        - including which clients have which files open,
          notifications, etc
    - wait for the old master's lease to expire
    - tell all the clients that there was a failover -- why?
       - did the old master send invalidations for all cache events
         and send all the appropriate notifications? can't be sure!
       - this is supposed to tell the client that was lost
       - though some apps just panicked and crashed on this
         notification!

Performance:
  - ~50k clients per cell (~30k proxied)
  - ~22k files, more than half open at once
  - most less than 1k
  - 2k RPC/s, 93% KeepAlives
  - < 0.07% are modifications!

Interesting asides about real-world distriubted ysstems engineering

"Readers will be unsurprised to learn that the fail-over code, which
is exercised far less often than other parts of the system, has been a
rich source of interesting bugs"

"In a few dozen cell-years of operation, we have lost data on six
occasions, due to database software errors (4) and operator error (2)"


----------

GFS

Google needed a distributed file system for their search engine
(late 90s)
  - why not use an off-the-shelf one? (NFS, AFS...)
  - very different workload characteristics!
  - able to design GFS for Google apps *and* design Google apps for
    GFS (i.e., backwards compatibility not a huge issue!)

What was the workload they were considering?
 - hundreds of web crawling clients
 - periodic batch analytics jobs: MapReduce
 - big data sets for the time: 1000 nodes, 300 TB storage space
 - note that this workload has changed over time! will discuss that
   later
     - and not just in the sense that 300 TB isn't that much!

Workload specifics:
  - few million 100MB+ files, nothing smaller, some much bigger
  - reads: small random reads and large streaming reads
    - index lookups, mapreduce reads?
  - writes:
    - many files written once, never rewritten
      [mapreduce intermediate steps]
    - random writes not supported!
    - mostly appends if anything

GFS interface
 - app-level library, not a POSIX file system
 - create, delete, open, close, read, write
   - concurrent writes might not be consistent!
 - record append:
   - supports consistent concurrent appends
 - snapshot

Life without random writes:
  - suppose we want to update a previous crawl result
    www.page1.com -> www.my.blogspot.com
    www.page2.com -> www.my.blogspot.com
  - want to change it because page2 no longer has the link, but a new
    page does
    www.page1.com -> www.my.blogspot.com
    www.page3.com -> www.my.blogspot.com
 - option: delete the old record, insert new record.
   - complex, requires locking!
 - GFS model: just delete the old file,
   create a new file where the program can append new records
   atomically


GFS architecture
 - each file stored as 64MB chunks (so better be a large file!)
 - each chunk is replicated onto 3+ chunkservers
 - single master stores metadata
  - file names
  - mapping from files to chunks
  - mapping from chunk to list of replicas

Single-master architecture
 - all metadata stored in memory -- ~64B/chunk
 - never stores data!
 - replicated with shadow masters

Fault tolerance:
 - single master, but a set of replicas -- master chosen w/ Chubby
 - master has an operation log for persistent logging of critical
   metadata updates
 - each log write is 2PC to the shadow masters
 - checkpoint log state periodically:
    essentially, take a snapshot of the database state, then switch to
    new log
    - why?

Write operations:
 - app originates write request
 - GFS client translates from fname, data -> fname, chunk-idx
 - master respons w/ chunk handle & primary/secondary replica
   locations
 - client pushes write data to all locations, stored internally
  [data plane operation]
 - client sends write command to primary
 - primary determines serial order, writes in that order
 - primary sends serial order to backups, they write
 - secondaries respond to primary, then primary responds to client

Caching: none!
 - avoid complexity of cache coherence
 - why is it ok to do without here but needed in Chubby?

Discussion: Is this a good design?

~15 years later
  - scale is much bigger: now hundreds of PB instead of TB
  - also, more like 10k servers
  - bigger change: not everything is batch updates to small files!
   - ~2010: Google moves to incremental updates of the index rather
     than periodically recomputing it w/ MapReduce

GFS endgame:
 - scaled to ~50M files, ~10 PB
 - developers had to organize their app around large files
 - latency sensitive applications suffered
 - replaced with a new design, Colossus (not many details)

What would you change to improve these problems?

Main scalability limit: single metadata master
 - same problem in HDFS (NameNode scalability)
 - approach: partition the metadata among multiple masters
 - what are the challenges here?
 - works out to ~100M files per master
 - also supports smaller chunks: 1MB instead of 64MB

Erasure coding vs replication:
  - 3 copies of each chunk is a lot!
  - erasure coding gives a more flexible tradeoff:
    n pieces + m check pieces
    - e.g., RAID-5: 2 disks, 1 parity disk (XOR of the other two)
  - less overhead
  - sub-chunk writes are more expensive (read-modify-write)
    - but GFS style workloads probably don't care...
  - recovery is difficult
    - generally: need to contact all the other replicas to get their
      pieces, do some math to build a new piece
  - Colossus: (6,3) Reed-Solomon: 1.5x overhead, 3 failures
  - Facebook HDFS: RS(10,4) - 1.4x overhead, 4 failures, expensive recovery
  - Azure: more advanced code, (12,4), 1.33x overhead, 4 failures,
    same recovery cost as Colossus

----------

BigTable
 - stores (semi)-structured data at Google
   - URLs -> contents, metadata, links
   - user -> preferences, recent queries
   - location -> map data
 - large scale
  - capacity: 100 B pages * 10 versions => 20 PB
  - throughput: 100M+users, millions of queries per second
  - latency: can only afford a few milliseconds per lookup

Why not use a commercial DB?
 - scale is too large, cost would be very high
 - low-level storage optimizations help significantly
   - hard to do this on top of a database layer
   - data model exposes locality, performance tradeoffs
 - can remove "unnecessary" features for better scalability
   - secondary indexes
   - multi-row transactions
   - integrity constraints
   [although people later decided they wanted each of these!]

Data model
 - a big, sparse, multi-dimensional table
 - (row, column, timestamp) -> cell contents
 - fast lookup on a key
 - also, rows ordered lexicographically so scans are in order
   - can do a sort or similar processing

Consistency
 - is it ACID?
 - strongly consistent: operations are processed by a single tablet
   server, in the order received
 - durability & atomicity via a commit log stored in GFS
 - what about transactions?
   - single-row only (read-modify-write, atomic compare-and-swap, etc)

Implementation
 - divide the table into tablets [~100MB], grouped by row range
 - each tablet is stored on a tablet server that manages 10-1000
   tablets
   - means that range scans are efficient because they are in the same
     tablet => same server
 - there's a master that assigns tablets to servers, rebalances across
   new/deleted/overloaded servers, coordinates splits
 - and a client library that locates the data

Is this just like GFS?
 - vaguely same architecture, but...
 - can leverage GFS and Chubby
 - tablet servers and the master are essentially stateless
   - tablet data gets stored in GFS, coordinatd w/ Chubby
   - master stores most config data in Chubby
   - if a server fails, the master just hands its list of tablets over
     to a new server that opens the GFS files
   - if the master fails, acquires a lock in Chubby, reads list of
     servers from Chubby, asks tablet servers what tablets they are
     responsible for
 - scalable metadata assignment
   - don't store the entire list of tablet -> server mappings in the
     master
     - would make the master a bottleneck for reads especially
     - hierarchical approach:
       - maintain a table of tablet -> IP/port ranges
       - store that in a series of tablets
       - store the list of those in one tablet
       - store that tablet's location in Chubby

Storage on one tablet
 - most data is stored in a set of SSTable files: sorted key-value
   pairs
 - writes go into a in-memory table + GFS log
 - periodically move data from the memtable to SSTables
   - basic approach: read all the SSTables in, remove stale data,
     merge them, write them back out
 - each machine responsible for about 100 tablets
  - so if it fails, get another 100 machines to pick up 1 tablet each

BigTable retrospective
 - definitely a scalable system!
 - used at Google still
 - motivated a lot of the NoSQL world
 - biggest mistake: not supporting distributed transactions
  [per Jeff Dean]
  - became really important as incremental updates started to take over
  - lots of people wanted them, tried to implement them themselves
    (often incorrectly!)
  - at least three papers subsequently fixed this in different ways
  - we'll read two next week!