Facebook Memcache architecture 1. Facebook's scaling problem Rapidly increasing user base (eg., 2x every 9 months), from a very small starting point. By 2013: ~ 1B users spread across the globe, each updating information many times per day. Amount of data is huge. Infrastructure has to keep pace 1.1 Goals Scale Performance Fault tolerance “Best effort eventual consistency” – whenever you hear the word “best effort”, replace it with the word, “not” 1.1 Strategy Adapt off the shelf components where possible Application logic needs to support rapid change => speed of adding new features is crucial, even more important than efficient operation (throw hardware at problem of making app development faster). Support third party apps (SOA). Fix as they go: no overarching plan (Double every four to six months) < A rule of thumb: need to rethink after every order of magnitude – so for them, need to rearchitect every year or so> Issues in going from a few servers to many, from many to a really large #, and then to multiple data centers 1.2 Workload Each user’s page is unique, draws on events posted by a number of other users Q: is clustering users likely to work? Can’t cluster users effectively, because of “Small world” network – maximum path length from any user to any other user is small. Users not in cliques (mostly) User popularity is zipf; some user posts affect very large #’s (millions) of other pages. 1.3 Workload Lots of small lookups, lots of dependencies Low spatial locality (all to all) App logic does many diffuse, chained reads, so latency of each read is crucial Much smaller update rate, but still large in absolute terms 1.4 Network Technology underlying network technology: data center has a non-uniform network latency of cross-data center communication is large 2. FB solutions 2.1 Three level architecture Front end web servers: app logic, stateless (if front end web server fails, can redirect client to another server) Memcache – lookaside cache layer Fault tolerant storage layer: authoritative storage could be (was originally) an SQL database. could be noSQL bag of bits. 2.2 Scale by hashing hash users to front end web servers hash keys to memcache servers hash files to SQL servers App code is all to all – a given user will pull data from a large # of memcache and storage servers Q: What happens if a front end web server goes down? How do we reassign its work? Q: What happens when we add a front end web server? How do we reassign work so that it gets its share? One option would be for the front end load balancers to keep a table of which users were assigned to which front ends, and to update those tables when nodes come and go. But we'd also need to have another table for finding which memcache server to use, and which SQL server. And we'd need to do the reassignment of work consistently over the cluster. (if we want the data returned to be up to date) Instead: consistent hashing. Hash both the requests and the servers onto the same ID space. Sort all the servers by their hash value H(Si) < H(Sj) Each server takes all requests that hash between it and next server. Q: How unbalanced is consistent hashing? Basic 312/332/421 result: what's the expected deviation in the size of a hash bucket? sqrt(# of buckets), if gaussian workload. Upside is fault tolerance: when a node fails, we reassign its work to the next server in the list. The expected deviation in workload does not change! With zipf workloads, expected worst case is worse than that -- we'll need some way to handle intensely used keys. We can do a bit better. Every server has 100 (1000) hash values, and it takes the work between any of its values and the next value. So expected deviation is much smaller, and when a node fails, its work is (most likely) spread over 100 different nodes. Easy bookkeeping: you just have to broadcast which servers are up 2.3 Scale by caching Memcache is a lookaside cache, to avoid going to storage server whenever possible. Goal: Try to use memory as efficiently as possible (limited replication of data, only for very frequently used data) Q: Is a lookaside cache serializable? Note: we’re only talking about per-key consistency. They clearly aren’t serializable across keys. Thread A Read cache If missing, Fetch from database Store back to cache Thread B Change database Delete cache entry Key idea: you can interleave any # of readers/writers. Analyze all possible cases, to see which ones would lead to inconsistency. Example: A: Read cache A: Read database B: change database B: Delete entry A: Store back to cache 2.3.1 Lookaside cache with leases Goal reduce (eliminate?) # of inconsistencies Reduce cache miss swarms On a read miss: leave a marker in the cache (fetch in progress) return timestamp check timestamp when filling the cache, if changed means value has (likely) changed: don't overwrite If another thread read misses, find marker and wait for update (retry later) Q: What if web server crashes while holding lease? Lease expires, then next web server to do read retries. Q: Is this sequentially consistent for a single key read/update? How would you try to answer that question? 2.3.2 Optimizations Improve latency with parallel lookups, batching, incast control Q: how do we balance among various types of items in the memcache pool? Shared pool: estimate likelihood of future reference and cost of a miss. Separate pool for different types of objects Gutter cache for outages (gutter auto-invalidates) All to all workload means that adding servers doesn’t help as much as you might think: reduces effectiveness of batching 2.4 Scale by partitioning Multiple independent clusters within a data center Data replicated in the caches in each partition => web server driven invalidations doesn’t work well: would need to invalidate every cluster on every update Replace with mcsqueal: background invalidation, batched, eventually consistent Q: is eventual consistency good enough? Web server making an update also invalidates the memcache copy in its cluster to reduce likelihood of inconsistency seen by user 2.4.1 Optimizations Regional pools for important but rare data (where you only want one copy) Cold cluster warmup – use another cluster as a cooperative cache However, this means that the invalidation mechanism doesn’t work! Their fix: don’t allow cache fills for 2 seconds after previous delete Why not simply say: if data has been deleted since cold start, then go to database 2.5 scale by geography Cross data center issues: one DC is the master, the others are (possibly out of date) slaves, for any particular data object Updates go to remote master, which propagates changes to other DCs Reads go to local DC Read misses go to local DC On update, temporarily redirect new cache misses to remote master, as database won’t have propagated the updates yet, by leaving a token in the MC database 3. Do you think their solution works well? How might you have solved the problem differently? Why don’t they do something obvious (e..g, use coherent caching). With all that: 0.1% of stores leave inconsistent cached copies that last at least a day ;-( 4. What technical challenges do they face? Ex: incast 5. TxCache: memcache with cache coherence FB provides: read my writes eventually consistent Why is read my writes hard? Can’t I just cache all my writes, and return them if the writer (!) asks? FB “solves” by making an attempt to have the user assigned to an FB memcache that stored their latest update Tx goal: Snapshot isolation for read-only transactions. Consistent with respect to some recent version of the database. Lots of different consistency models: e.g., database isolation levels. Idea was to allow apps that didn’t care, to proceed without affecting the apps that do care. (Everyone pays for what they need.) Is it ok to read a write before it is committed? (means read a value that may not occur) Is it ok to release a read lock before the transaction commits? (non-repeatable reads) Snapshot says: read-only transaction occurs at a consistent point, just not one that is at the present time Via: Modify database to expose version #’s Allow fetches of old versions Cache maintains valid ranges for each item (may have multiple versions of same data) Issues: Do we need to refetch entire contents of the cache every few seconds? Cached contents can be marked as valid in the present time, if invalidations are applied before the lookup Transaction may need to do lookups in various caches, which version to use? various caches and databases may have different versions cached Option 1: Fetch all the metadata, then pick the best one Option 2: Take a snapshot every few seconds; always ok to use the latest one Txcache uses a bit more complex system than this, that appeared to approximate option 2 -------