Today: - finish up BigTable - overview of transactions Two implementations of transactions on BigTable - MegaStore - Spanner Recent UW research: TAPIR ---------- [from last week, since we didn't get through most of this] BigTable - stores (semi)-structured data at Google - URLs -> contents, metadata, links - user -> preferences, recent queries - location -> map data - large scale - capacity: 100 B pages * 10 versions => 20 PB - throughput: 100M+users, millions of queries per second - latency: can only afford a few milliseconds per lookup Why not use a commercial DB? - scale is too large, cost would be very high - low-level storage optimizations help significantly - hard to do this on top of a database layer - data model exposes locality, performance tradeoffs - can remove "unnecessary" features for better scalability - secondary indexes - multi-row transactions - integrity constraints [although people later decided they wanted each of these!] Data model - a big, sparse, multi-dimensional table - (row, column, timestamp) -> cell contents - fast lookup on a key - also, rows ordered lexicographically so scans are in order - can do a sort or similar processing Consistency - is it ACID? - strongly consistent: operations are processed by a single tablet server, in the order received - durability & atomicity via a commit log stored in GFS - what about transactions? - single-row only (read-modify-write, atomic compare-and-swap, etc) Implementation - divide the table into tablets [~100MB], grouped by row range - each tablet is stored on a tablet server that manages 10-1000 tablets - means that range scans are efficient because they are in the same tablet => same server - there's a master that assigns tablets to servers, rebalances across new/deleted/overloaded servers, coordinates splits - and a client library that locates the data Is this just like GFS? - vaguely same architecture, but... - can leverage GFS and Chubby - tablet servers and the master are essentially stateless - tablet data gets stored in GFS, coordinatd w/ Chubby - master stores most config data in Chubby - if a server fails, the master just hands its list of tablets over to a new server that opens the GFS files - if the master fails, acquires a lock in Chubby, reads list of servers from Chubby, asks tablet servers what tablets they are responsible for - scalable metadata assignment - don't store the entire list of tablet -> server mappings in the master - would make the master a bottleneck for reads especially - hierarchical approach: - maintain a table of tablet -> IP/port ranges - store that in a series of tablets - store the list of those in one tablet - store that tablet's location in Chubby Storage on one tablet - most data is stored in a set of SSTable files: sorted key-value pairs - writes go into a in-memory table + GFS log - periodically move data from the memtable to SSTables - basic approach: read all the SSTables in, remove stale data, merge them, write them back out - each machine responsible for about 100 tablets - so if it fails, get another 100 machines to pick up 1 tablet each BigTable retrospective - definitely a scalable system! - used at Google still - motivated a lot of the NoSQL world - biggest mistake: not supporting distributed transactions [per Jeff Dean] - became really important as incremental updates started to take over - lots of people wanted them, tried to implement them themselves (often incorrectly!) - at least three papers subsequently fixed this in different ways - we'll read two next week! ---------- Transactions Goal: group a bunch of individual operations into a larger atomic action - e.g., savings_balance -= 100; checking_balance += 100 - don't want to see one without the other - even if the system crashes [atomicity/durability] - even if other transactions are running concurrently [isolation] - we are mostly going to talk about this one How do we traditionally achieve these? - i.e., in a traditional single-node database - atomicity/durability: write-ahead logging - write each operation that we want to do to a log to disk - then write a commit record that makes all ops commit - only tell the client the transaction is done after writing commit record - isolation: concurrency control, usually strict two-phase locking - keep a lock per object/DB row, usually single-writer/multi-reader - when reading or writing an object, acquire the appropriate lock - hold all locks until after commit (and the commit record is written) - release locks together after commit Note that I am oversimplifying here - just the single-node case is the topic of half of a database class - actually making those solutions work and be efficient is fiendishly difficult - ...so let's go do something even more complicated - what makes it harder in a distributed system? - savings_balance and checking_balance might be on different nodes - they might be replicated or cached - need to coordinate the ordering of operations across them too! What does correctness mean for isolation, anyway? - usual definition: serializability each transaction's reads and writes are consistent with what you'd see if they executed in some serial order, one transaction at a time, on a single processor - strict serializability = linearizability - same definition + real-time requirement - traditional database papers usually don't talk about this much because it goes without saying - single node S2PL provides this Can we have weaker levels of isolation? - after all, we had weaker levels of consistency: - causal consistency, eventual consistency, etc - yes: these allow different anomalies: non-serial behavior of transactions - e.g., see part of one operation without the others - snapshot isolation, repeatable read, read committed, etc. Weak isolation vs weak consistency - at strong consistency levels, these are the same: linearizability/serializability: - there is a single serial order of all transactions - weaker isolation: - operations aren't necessarily atomic, e.g., - savings -= 100 checking += 100 read savings, checking - but everyone agrees on what sequence of events were seen - this can occur - weaker consistency: - operations are atomic, but might disagree on order - e.g., A sees [savings -= 100; checking += 100] [read savings, checking] - B sees: [read savings, checking] [savings -= 100; checking += 100; ] - you could combine both weak isolation & weak consistency - here there be dragons The traditional database solution to atomic/isolated transactions in distributed systems: two-phase commit - model: now the database is partitioned over different hosts. still only one copy of each data item, though - as before, 2-phase locking: when reading or writing an object, acquire lock - when ready to commit transaction: coordinator sends a prepare message to all shards - they can either respond prepare_ok, and promise to be able to commit the transaction later, or abort -- last chance to abort - making this work usually requires writing a durable log entry! - then coordinator sends commit message to all shards - each shard then writes commit record to disk and releases logs Why isn't this the end of the story? - what if there are failures, either of shards or coordinator? - 2PC is a blocking protocol; no progress can be made - some protocols to address, e.g., coordinator recovery - performance: - can we really afford to take locks for the entire duration of a transaction? ---------- MegaStore Subsequent storage system to BigTable - provide an interface that looks more like SQL [but not exactly!] - provide multi-object transactions - as usual, Google was vague about what was actually using it - but later revealed: gmail, picasa, calendar, etc; also available thru Google App Engine Conventional wisdom - hard to have consistency & performance in the wide area - consistency requires communication to coordinate; expensive in wide-area - hard to have consistency and availability at once - need 2PC across copies; what about partitions? - one solution: relaxed consistency [next week] - MegaStore: try to provide it all! (but note - not exactly a high performance system!) Megastore architecture - each data center has - app server w/ megastore library - replication server - coordinator - BigTable cluster - data in BigTable is the same everywhere Setting - browser web requests may arrive at any replica - i.e., application server at any replica - no designated primary replica! - so could easily be concurrent transactions on same data from multiple replicas! Data model - schema has a set of tables, containing a set of entities with a set of properties, - looks basically like a SQL table, but... - annotations about which data are accessed together (IN TABLE, etc) - annotations about which data can be updated together: entity groups Aside: some of these design choices are very non-traditional for a database Key principle for relational DB: data independence - users should specify schema for their data, define the operations they want to do on it, and let the DBMS figure out how to store it! Consequence: performance impact is not transparent - easy to write a query that will take forever - especially in a distributed environment: might have to join data from all of your partitions MegaStore argument: - make the performance choices explicit - make users do things that could be really expensive, like joins, themselves (do you agree?) Translation from schema to BigTable - keeps related rows adjacent => on same tablet for easy scan Entity groups - transactions can only use data within a single entity group - one row or a set of related rows, defined by application - e.g., all my gmail messages in 1 entity group, yours in different one - example transaction: move message 321 from [my] Inbox to Personal - not a transaction: deliver message to Dan, Haichen, Adriana How to implement transactions? - each entity group has a transaction log, stored in BigTable - data in BigTable is the result of executing log operations - to commit a transaction, use Paxos to add an entry to the log - basically, like lab 3 except that log entries are transactions instead of single ops! More specifically: - find highest paxos log entry number (N) - read data from local bigtable - accumulate writes in temporary storage - create a log entry that is the set of writes - use paxos to agree that the new entry is log entry n+1 - apply writes in log entry to bigtable data What does this mean? - need to wait for inter-data-center messages for commit! - only a majority of replicas need to respond - might be missing log entries; need to repair this later - there isn't a leader (unlike Chubby), so definite risk of conflicts - might be conflicting transactions What about concurrent transactions? - suppose two concurrent operations try to commit, modify X - MegaStore must abort one, allow the other to commit - achieves this by catching conflicts during Paxos agreement - Paxos allows only one entry to get entry N+1 - other application server will learn that it didn't win - needs to retry the whole transaction - Does this work? - actually, it prohibits *any* concurrency within an entity group! - even if the transactions do not actually conflict! - traditional DB locking would not have this problem: would allow concurrency on non-overlapping data Paxos tricks - distinguishing a proposer - we talked about this before in the context of Multi-Paxos and Chubby - Megastore uses a slightly different trick: last writer to the log for an entity group gets to be the distinguished proposer for next write - witnesses - can have up to f replicas that don't actually apply the log, just write down log entries - still running Paxos, but don't have to update their bigtable - can be promoted to full replicas if one fails What about transactions across entity groups? - out of luck? - two-phase commit? Performance - a couple transactions per second per entity group - 10s of milliseconds for a read - 100s of milliseconds for a write - Is this OK? ---------- Spanner - subsequent system [2012] from Google - backend for the F1 databases, which stores Google's ads database - before that it was a mess of partitioned & replicated mysql instances - addresses limitations of MegaStore - no notion of entity groups, no restrictions on transaction scope - can have more than one concurrent transaction at a time - support lock-free read-only transactions Example: social network - simple schema: user posts and friends lists - sharded across thousands of machines - replicate data across multiple continents Read-only transaction example: - generate page of friends' recent posts - read friends list, read their posts - what if I remove friend X, post mean comment about him? - locking answer: - acquire read locks on friends lists, acquire read locks on our posts - prevents them from being modified while we run this transaction - this could be really slow! - hence, want to support lock-free r/o transactions Spanner architecture - each shard (tablet) is stored in a Paxos group - replicated across multiple data centers - has a relatively stable leader - on each replica, basically stored like BigTable in GFS - use Paxos to ensure that each replica has the same state - transactions can span Paxos groups - implemented as 2PC on top of Paxos groups! - leader of each Paxos group manages a lock table for concurrency control - could have every replica do it; this is an optimization - one paxos group leader becomes the 2PC coordinator, others the participants quote Basic 2PC/Paxos approach: - during execution, read and write objects - contact the appropriate Paxos group leader, acquire locks - client decides to commit, notifies the coordinator - coordinator contacts all shards, sends PREPARE message - they Paxos-replicate a prepare log entry (including locks), vote either ok or abort - if all shards vote OK, coordinator sends commit message - each shard Paxos-replicates commit entry - leader releases locks This is just standard techniques - 2PC, but replacing disk log writes with Paxos replicated log - same architecture in various (mostly research) systems What's left? - lock-free r/o transactions - how to do this? timestamps! - assign meaningful timestamp to transaction - order transactions meaningfully - then, reasonable to say, r/o transaction X reads at timestamp 10 TrueTime - many people think the magic here is atomic clocks and GPS - this is actually pretty standard stuff! (NTP, etc) - the key idea here is exposing the clock *uncertainty*! Truetime API: - TTInterval tt = TT.now() - "correct" time is "guaranteed" to be between tt.latest and tt.earliest - what does this actually mean? Implementing TrueTime - have set of time masters (GPS clocks, atomic clocks) in each data center - NTP or similar protocol syncs with multiple masters, detects/rejects faulty ones - TT returns local clock value, plus uncertainty - uncertainty = time since last sync * 200 usec / sec - where did that number come from? Assigning a timestamp to a transaction - consider the 2PC commit procedure. When can we assign a timestamp? - any time between when all locks are held and the first lock released (why?) - what timestamp do we pick? one that TrueTime says is in the future - how do we know that it's consistent with global time? - wait until TrueTime says it's in the past - this requires waiting out clock uncertainty! - leads to external consistency = linearizability Spanner actually does something a little more complicated - 2PC coordinator gathers a number of TrueTime timestamps - prepare timestamps from non-coordinator shards - timestamp that the coordinator received the commit message - highest timestamp assigned to any previous transaction - actually picks a timestamp that is greater than all of these, plus TT.latest - like a Lamport clock What does commit wait mean? - larger uncertainty bound from TrueTime => longer commit wait period - longer commit wait period => locks held longer => lower throughput - so time uncertainty means Spanner gets slower! What does this buy us? - can now do a read-only transaction at a particular timestamp, have it guaranteed to be meaningful - assuming a multiversion store where we keep the old versions around - can use this to do serializable ("globally consistent") transactions in the past - or in the present: pick a future timestamp, wait for it to be in the past What if TrueTime fails? - Google argument: picked using engineering considerations; more unlikely than a CPU failure and we don't handle those anyway - but what if it went wrong? - can cause very long commit wait periods, making system very slow - can break ordering guarantees: no longer have external consistency (linearizability) - but still guaranteed serializability because the approach of taking the max of gathered timestamps is essentially a Lamport clock