Today: three real-world systems from Google - GFS: large-scale storage - BigTable: scalable storage of semistructured data - Chubby: coordination to support other services Each of these systems has been very influential - lots of open-source clones [not exactly identical!] - GFS -> HDFS - BigTable -> HBase, Cassandra - Chubby -> ZooKeeper - also 10+ years old (2006, 2006, 2003) - and had been in use for a few years before then (until Google felt they could tell us about it!) - so these systems have undergone major changes since then - some have been replaced by successors (e.g., GFS -> Colossus) - some have successors but are still around - need to support legacy code even internally! These are real systems - as are most of the papers we'll read for the rest of the class: either real systems from industry or recent research systems [as opposed to theoretical papers like Paxos] - they are not necessarily the best design - in fact, some times the authors have pointed this out! - lots of opportunities for discussion about whether this is the best solution for this problem, or other problems - or whether the problem they were solving was even the right one! - lots of interesting anecdotes about side problems, esp in Chubby paper First system: Chubby - one of the first distributed "coordination services" - allow client apps to synchronize themselves and their environment - e.g., select a GFS master - e.g., find the BigTable directory - e.g., be the view service from Lab 2 - internally, a Paxos replicated system History: - you might imagine: - Google has a lot of services that need reliable coordinatoin - they are doing ad-hoc things that have bugs - Paxos is a better answer but it's complicated - so they decided to build a better service - in fact: - first attempt did not use Paxos - used Berkeley DB's commercially available replication [see "Paxos Made Live"] - this did not go well; major source of bugs! - not clear what algorithm it is using or whether it is correct - migrating data to/from BDB was a major source of bugs for early Paxos-Chubby Interface - simple file system like interface - hierarchical files, usually pretty small (couple KB) - Open/Close - GetContents, SetContents, Delete - Acquire, TryAcquire, Release - GetSequencer, SetSequencer, Check Sequencer [more later] [plus a few other details; not many] - non-mandatory locking - why? - really up to the application to decide what holding a lock might mean -- might be more than just file contents! - failure recovery, debugging, administrator access wants to bypass the lock Example - primary election x = Open("/ls/cell/service/primary") if (TryAcquire(x) == success) { // I'm the primary, tell everyone SetContents(x, my-address) } else { // I'm not the primary, find out who is primary = GetContents(x) // also set up notifications in case the primary changes } Why this interface? - why not a Paxos consensus library? - one answer: Paxos is hard, nobody knows how to use it - apparently they don't know how to use locks either but they think they do. :-) - has this changed since then? - maybe better answer: backwards compatibility - other answer: want to advertise results outside the system - e.g., tell all the clients where the BigTable root is, not just the master - another answer: want a separate set of nodes to run consensus rather than your service - allows the system to work even with majority failures - e.g., chain replication Recall state machine replication: - system maintains state, (state, input) -> (new state, output) - i.e., system state & output entirely determined by inputs - then replication is just a problem of agreeing on the order of inputs - and Paxos can help - what does this require? - need determinism: clocks, randomness, etc need to be handled specially - parallelism is tricky - servers can't communicate "outside" of the system, i.e., except through state machine ops - state machine replication is a great way to build a distributed system from scratch! - but really hard to retrofit to an existing system! Big picture of implementation: - 5 replicas in a cell, one is the master, Paxos replicates a log of ops - should sound familiar! Challenge: performance - note that Chubby is not a high performance system! < 1000 writes/sec originally - but definitely more than this many incoming RPCs - but don't want to replicate *every* operation! Baseline: Multi-Paxos - from last lecture - quick recap: one node is the leader, it runs the first phase of Paxos ahead of time for all instances, reserves the right to write to any value in the log - then the leader can commit new operations with one phase (one RT to a majority of replicas) - if another node suspects the leader is faulty, it runs a view change protocol: full two-phase version of Paxos - this makes sure it gets all the ops the old leader committed - and the first phase locks out the old leader from committing new ops Paxos performance: - last time: batching and partitioning - other ideas in the paper: leases and caching - other ideas? Leases: - observe: in a Paxos system, much like Lab 2, the primary can't unilaterally respond to any request, even reads! - why? what goes wrong? - rest of system decides the primary has failed, assembles a new quorum - rest of system starts committing new writes without telling the old primary - but the old primary thinks it's still the primary, so still answers reads with the old version - standard answer is to use coordination (Paxos) on all requests but this can be slow - pretty common optimization: give the leader a "lease" - time-based, say 10 seconds (renewable by contacting a majority of replicas) - leader is allowed to respond to reads unilaterally while it has the lease - what do we have to do when we change leader? wait out the lease period -- why? Caching: - our usual strategy. and our usual questions about caching - what does Chubby cache? - file data, metadata - includes the fact that a file is not present - what consistency level does Chubby provide? - strictly consistent: linearizable operations - note that Zookeeper does not do this -- thought it was unnecessary - and the Chubby paper says it was too for some operations How does Chubby implement caching? - kind of like the DSM paper - client maintains a local cache - master keeps a list of which clients might be caching - master sends invalidations (not updates -- why?) - might not need the newly updated version, then you're sending updates out forever - cache entries have leases -- so they expire automatically after a few seconds - gives us an answer if the clients don't respond quickly to invalidation A surprising use case: - programmers often wound up using Chubby for service discover instead of DNS - why? better consistency - how does DNS caching work? - purely time based: each entry has a configurable expiration time - if really high: slow to update (one day is common!) - if low: cache is no good! - what would Chubby do instead? - clients get to keep it in their cache - server invalidates when it changes - just need to periodically check in to renew the cache lease - is this better? - definitely, if items change infrequently but we want a notice - way more complicated for the server -- so DNS probably shouldn't do this! Client failure: - clients have a persistent connection to Chubby - need to acknowledge it with periodic keep-alives (~10 seconds) - if none received, Chubby server treats it as failed, removes its cache entries from its table, drops any locks it's holding, etc Master failure: - from the client's perspective: - don't hear from the master in a timeout - declare session in jeopardy, tell the app - clear cache, any Chubby ops block - after grace period (45 secs), give up, assume the session was lost - from the system's perspective: - run a Paxos round to elect a new master - increment a master epoch number (like a view number!) to keep them straight - new master rebuilds its database with all the ops from the old primary - including which clients have which files open, notifications, etc - wait for the old master's lease to expire - tell all the clients that there was a failover -- why? - did the old master send invalidations for all cache events and send all the appropriate notifications? can't be sure! - this is supposed to tell the client that was lost - though some apps just panicked and crashed on this notification! Performance: - ~50k clients per cell (~30k proxied) - ~22k files, more than half open at once - most less than 1k - 2k RPC/s, 93% KeepAlives - < 0.07% are modifications! Interesting asides about real-world distriubted ysstems engineering "Readers will be unsurprised to learn that the fail-over code, which is exercised far less often than other parts of the system, has been a rich source of interesting bugs" "In a few dozen cell-years of operation, we have lost data on six occasions, due to database software errors (4) and operator error (2)" ---------- GFS Google needed a distributed file system for their search engine (late 90s) - why not use an off-the-shelf one? (NFS, AFS...) - very different workload characteristics! - able to design GFS for Google apps *and* design Google apps for GFS (i.e., backwards compatibility not a huge issue!) What was the workload they were considering? - hundreds of web crawling clients - periodic batch analytics jobs: MapReduce - big data sets for the time: 1000 nodes, 300 TB storage space - note that this workload has changed over time! will discuss that later - and not just in the sense that 300 TB isn't that much! Workload specifics: - few million 100MB+ files, nothing smaller, some much bigger - reads: small random reads and large streaming reads - index lookups, mapreduce reads? - writes: - many files written once, never rewritten [mapreduce intermediate steps] - random writes not supported! - mostly appends if anything GFS interface - app-level library, not a POSIX file system - create, delete, open, close, read, write - concurrent writes might not be consistent! - record append: - supports consistent concurrent appends - snapshot Life without random writes: - suppose we want to update a previous crawl result www.page1.com -> www.my.blogspot.com www.page2.com -> www.my.blogspot.com - want to change it because page2 no longer has the link, but a new page does www.page1.com -> www.my.blogspot.com www.page3.com -> www.my.blogspot.com - option: delete the old record, insert new record. - complex, requires locking! - GFS model: just delete the old file, create a new file where the program can append new records atomically GFS architecture - each file stored as 64MB chunks (so better be a large file!) - each chunk is replicated onto 3+ chunkservers - single master stores metadata - file names - mapping from files to chunks - mapping from chunk to list of replicas Single-master architecture - all metadata stored in memory -- ~64B/chunk - never stores data! - replicated with shadow masters Fault tolerance: - single master, but a set of replicas -- master chosen w/ Chubby - master has an operation log for persistent logging of critical metadata updates - each log write is 2PC to the shadow masters - checkpoint log state periodically: essentially, take a snapshot of the database state, then switch to new log - why? Write operations: - app originates write request - GFS client translates from fname, data -> fname, chunk-idx - master respons w/ chunk handle & primary/secondary replica locations - client pushes write data to all locations, stored internally [data plane operation] - client sends write command to primary - primary determines serial order, writes in that order - primary sends serial order to backups, they write - secondaries respond to primary, then primary responds to client Caching: none! - avoid complexity of cache coherence - why is it ok to do without here but needed in Chubby? Discussion: Is this a good design? ~15 years later - scale is much bigger: now hundreds of PB instead of TB - also, more like 10k servers - bigger change: not everything is batch updates to small files! - ~2010: Google moves to incremental updates of the index rather than periodically recomputing it w/ MapReduce GFS endgame: - scaled to ~50M files, ~10 PB - developers had to organize their app around large files - latency sensitive applications suffered - replaced with a new design, Colossus (not many details) What would you change to improve these problems? Main scalability limit: single metadata master - same problem in HDFS (NameNode scalability) - approach: partition the metadata among multiple masters - what are the challenges here? - works out to ~100M files per master - also supports smaller chunks: 1MB instead of 64MB Erasure coding vs replication: - 3 copies of each chunk is a lot! - erasure coding gives a more flexible tradeoff: n pieces + m check pieces - e.g., RAID-5: 2 disks, 1 parity disk (XOR of the other two) - less overhead - sub-chunk writes are more expensive (read-modify-write) - but GFS style workloads probably don't care... - recovery is difficult - generally: need to contact all the other replicas to get their pieces, do some math to build a new piece - Colossus: (6,3) Reed-Solomon: 1.5x overhead, 3 failures - Facebook HDFS: RS(10,4) - 1.4x overhead, 4 failures, expensive recovery - Azure: more advanced code, (12,4), 1.33x overhead, 4 failures, same recovery cost as Colossus ---------- BigTable - stores (semi)-structured data at Google - URLs -> contents, metadata, links - user -> preferences, recent queries - location -> map data - large scale - capacity: 100 B pages * 10 versions => 20 PB - throughput: 100M+users, millions of queries per second - latency: can only afford a few milliseconds per lookup Why not use a commercial DB? - scale is too large, cost would be very high - low-level storage optimizations help significantly - hard to do this on top of a database layer - data model exposes locality, performance tradeoffs - can remove "unnecessary" features for better scalability - secondary indexes - multi-row transactions - integrity constraints [although people later decided they wanted each of these!] Data model - a big, sparse, multi-dimensional table - (row, column, timestamp) -> cell contents - fast lookup on a key - also, rows ordered lexicographically so scans are in order - can do a sort or similar processing Consistency - is it ACID? - strongly consistent: operations are processed by a single tablet server, in the order received - durability & atomicity via a commit log stored in GFS - what about transactions? - single-row only (read-modify-write, atomic compare-and-swap, etc) Implementation - divide the table into tablets [~100MB], grouped by row range - each tablet is stored on a tablet server that manages 10-1000 tablets - means that range scans are efficient because they are in the same tablet => same server - there's a master that assigns tablets to servers, rebalances across new/deleted/overloaded servers, coordinates splits - and a client library that locates the data Is this just like GFS? - vaguely same architecture, but... - can leverage GFS and Chubby - tablet servers and the master are essentially stateless - tablet data gets stored in GFS, coordinatd w/ Chubby - master stores most config data in Chubby - if a server fails, the master just hands its list of tablets over to a new server that opens the GFS files - if the master fails, acquires a lock in Chubby, reads list of servers from Chubby, asks tablet servers what tablets they are responsible for - scalable metadata assignment - don't store the entire list of tablet -> server mappings in the master - would make the master a bottleneck for reads especially - hierarchical approach: - maintain a table of tablet -> IP/port ranges - store that in a series of tablets - store the list of those in one tablet - store that tablet's location in Chubby Storage on one tablet - most data is stored in a set of SSTable files: sorted key-value pairs - writes go into a in-memory table + GFS log - periodically move data from the memtable to SSTables - basic approach: read all the SSTables in, remove stale data, merge them, write them back out - each machine responsible for about 100 tablets - so if it fails, get another 100 machines to pick up 1 tablet each BigTable retrospective - definitely a scalable system! - used at Google still - motivated a lot of the NoSQL world - biggest mistake: not supporting distributed transactions [per Jeff Dean] - became really important as incremental updates started to take over - lots of people wanted them, tried to implement them themselves (often incorrectly!) - at least three papers subsequently fixed this in different ways - we'll read two next week!