_____________ SECTION3 Raymond Cheng & Irene Zhang _____________ Table of Contents _________________ 1 Lab 2b architecture 2 Client .. 2.1 State .. 2.2 Functions .. 2.3 Tasks 3 Servers .. 3.1 State .. 3.2 Functions .. 3.3 Primary .. 3.4 Backup 4 At-most-once RPC design 1 Lab 2b architecture ===================== Overview: - In this part of the lab, you are building key-value servers - Correctness must be maintained as long as (1) the view service is never down, and (2) at least one server is up at all times - Liveness/progress as long as servers are up and resent message get eventually delivered (common asynchronous network model) - Works with network partitions, packet drops - Only 1 primary active at a time Assumptions: - Viewservice never crashes - Each client has only 1 outstanding put/get 2 Client ======== 2.1 State ~~~~~~~~~ - current view - op id - no lock! Clients are closed loop, so only execute next op after last one returns 2.2 Functions ~~~~~~~~~~~~~ - MakeClerk (initialization) - Get - Put 2.3 Tasks ~~~~~~~~~ 1. Get view from view service - When no view - When operation has failed 2. Send gets and puts to primary - What happens if the primary doesn't respond or responds that it is not the primary? Sleep for a while and check the view service again - How does a server deduplicate 2 identical requests (retries)? Label operations with a unique ID 3 Servers ========= - State machine is a cross product of the state of nodes (view service, primary, backup, idleservers, client) - Any op to client causes state changes at each and messages from one node to another (with possible retries) 3.1 State ~~~~~~~~~ - current view - lock - data - map[string]string - log of operations 3.2 Functions ~~~~~~~~~~~~~ - primary put handler, backup put handler - primary get handler, backup get handler - primary update view, backup update view - tick 3.3 Primary ~~~~~~~~~~~ 1. find view from view service. Do this in the tick() function. - What if the primary is out of date between ticks? Stop processing ops! But doesn't need to immediately update view. - How does the primary find out that it is out of date? The backup may refuse an op. 2. Respond to view service with the view. - What happens if the primary does not respond? The view service might declare the primary to be dead if something locks for too long - When does the primary switch to new view? 3. On new view with new backup, send the complete key/value database to the backup - What else might have to be transferred? operation log, which we'll cover later - What happens if it can't contact the backup? Can't make progress 4. Handle Get and Put. Store keys and values in a map[string]string - What about concurrent puts and gets? 5. Forward ops to backup. - What if the primary can't reach the backup? Primary can't proceed because the backup might have been promoted! - What about gets? They need to be forwarded too otherwise an old primary might serve out-of-date gets. Backing them up ensures this "split brain" case doesn't happen. Also easy for ensuring that the primary and backup make same response to read if they get them at different times (although not crucial for this lab) 3.4 Backup ~~~~~~~~~~ 1. find view from view service. Do this in the tick() function. - What if the backup is out of date between ticks? Can ignore or check view (doesn't matter) - How does the back find out that it is out of date? Might get a message from a client for the primary and find out that it has been promoted 2. Respond to view service with the view. - When does the backup switch its view? Only after it finishes getting the latest state from the primary 3. Handle Get and Put. Store keys and values in a map[string]string - What about concurrent puts and gets? Primary should only send one at a time 4. Tells the primary if it was demoted and is still sending ops - What case does this happen? When the primary can contact the backup but not the view service 4 At-most-once RPC design ========================= 1. Use a unique client ID and a op id for each put or get 2. Have servers store response to each op 3. Have primary check if they have seen and processed each op. If so, return the last response - What happens when primary fails? The backup should have all of the op state, so it can detect the same set of duplicates - What state does the backup need to store? The primary needs to send the operation log in the transfer. Then the backup needs to update its operation log when getting backup ops from primary 5 Ordering of events ==================== When to contact the view server 1. The server should only Ping periodically (in tick()) never in the Put/Get RPC handlers 2. The client should only ask for a new view if it: - doesn't know who the primary is - is communicating to the wrong primary