Lecture 2: RPC and Design Docs

The reading for today

Hey, first of all, there are optional readings sometimes! Also notes. None of it is required, but I hope it might help you. The only required readings are the research papers you write blog post responses to during the last two weeks of the course.

Anyway, to today's reading:

  • Lots of good discussion, if a bit dated.
  • Check out our sweet animations
  • In general, readings for first 8 weeks supplement lectures and are good reference material.
    • Optional
    • Ok to read fast and come back to it later as needed. (Good skill.)

The eight fallacies

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. The network topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

More course mechanics stuff

Teaching philosophy

  • To understand X, build an X
  • Systems <-> Theory

Project info

Overview

  • The course has a large quarter-long programming project
  • You will build a sharded, linearizable, fault-tolerant key-value store with dynamic load balancing and atomic multi-key transactions.
  • You are not expected to know what any of those words mean yet! You will by the end of the quarter.
  • Some (informal) definitions to get us started:
    • key-value store: persistent hash map (persistent meaning clients can store some data, go away and come back later and still find the data they stored earlier)
    • linearizable: from the client's perspective, the service "looks like" a single (and single-threaded) server
    • fault-tolerant: system continues to work despite some of its components failing
    • sharded: split key space across multiple nodes to increase performance
    • dynamic load balancing: keys can move between nodes, e.g. to spread work more evenly
    • atomic multi-key transactions: support linearizable operations over multiple keys, even if those keys are stored on different shards
  • It's still ok if this doesn't make much sense yet :)

Project Mechanics

  • Lab 0: introduction to our framework and tools
    • Read through lab 0 and set up tools before section this week
    • Nothing to turn in
  • Lab 1: exactly once RPC, key-value store
    • Due Friday, January 13
    • Completed individually
  • Lab 2: primary/backup
    • Intro to replication for fault tolerance
    • Has some important limitations
  • Lab 3: Paxos
    • Realistic fault tolerance algorithm
    • Addresses many limitations of lab 2
  • Lab 4: Sharding, Load balancing, multi-key cross-shard transactions

Project Tools

  • Automated testing and grading
    • All tests and grading data is available to you while you complete the project
    • Two kinds of tests
      • "Run tests": specific scenarios constructed by staff to exercise important parts of your code
      • "Search tests": exhaustively search the space of executions using a model checker, makes sure your code works under all possible message delivery orderings and node failures
  • Visual debugger
    • Control and replay over message delivery and node failures
  • Implement in Java
    • Several restrictions to support model checking
    • Model checker needs to:
      • Detect all possible "next steps" from a given state
      • Select and execute each such step, in an order of its choosing
      • Collapse equivalent states for efficiency
    • Several consequences:
      • Our systems must be deterministic (given same inputs, perform exactly the same actions)
      • Must not "hide" any state from the model checker (static mutable fields, "real" I/O, random number generators, current date/time, ...)
    • A great example of engineering a system to be tested
      • In this class, we take this to the extreme: testing is the primary concern, because correctness is so challenging in distributed systems and is essential to understanding. We won't run our systems except in the test framework, so it becomes a kind of simulator.
      • In the real world, we would also need to take other concerns into account (performance on real networks and machines, etc.)
      • In fact, our test framework can be adopted to run on real networks (see Slabs paper)
      • In the real world, we recommend you don't place any less emphasis on correctness than we do in this class, but instead simply add further design constraints on performance, etc.
      • Systems engineered in this way tend to look quite different from the "naive" approach of "just write some code that calls the network API in various places".

Remote Procedure Call (RPC)

  • Recall last time we set the stage for RPC.
  • "Like a procedure call, but remote"
  • Ideally want it to be as easy to use for both the client and the server as a local procedure call.

Intro to RPC

RPC from a programmer's perspective

  • Key difference: function to call is on a different machine
  • Goal: make calling remote functions as easy as local functions
    • Want the application programmer to be able to just say f(10) or whatever, and have that automatically invoke f on a remote machine, if that's where f lives.
  • Mechanism for achieving this: an RPC framework that will send request messages to the server that provides f and response messages back to the caller
    • Most of the same details from the local case apply
    • Need to handle arguments, which function to call (the label), where to return to, and returned values.
    • Instead of jumping, we're going to send network messages.

Here's how it's done:

  • In the high-level language, we write the name of the function and some expressions to pass as arguments
  • The compiler and RPC framework:
    • orchestrate evaluating the arguments on the clients
    • the client makes a normal, local procedure call to something called a "stub"
    • the stub is a normal local function implemented/autogenerated by the RPC framework that:
      • serializes the arguments (converts them into an array of bytes that can be shipped over the network)
      • sends the function name being called and the arguments to the server (this is called a request message)
      • waits for the server to respond with the returned value (this is called a response message)
      • deserializes the returned value and then does a normal, local return from the stub to return that value to the caller
  • The server sits there waiting for requests. When it receives one, it:
    • parses the message to figure out what function is being requested
    • deserializes the arguments buffer according to the type signature of the function
    • invokes the requested function on the deserialized arguments
    • serializes the returned value and sends a response to the client

Here is a diagram of what happens when a client invokes an RPC:

Now is a good time to get familiar with this style of diagram. We will be seeing them a lot this quarter. Key points:

  • Time flows down
  • Columns are nodes
  • Diagonal arrows are messages (they arrive after they were sent, hence pointing slightly down)

A silly example

Suppose we have this function on the server:

int total = 0;

int incrementBy(int n) {
  total += n;
  return total;
}
And suppose we set up our server with our RPC framework to accept calls to incrementBy (and probably other functions too).

Now we have a client that wants to call incrementBy, as in:

void main() {
  int x = 10;
  int ans = incrementBy(10);  // *
  print(ans);
}
And suppose we set up our RPC framework on the client and told it where to find the server, and we've told the framework about incrementBy and its type signature.

Here is what will happen:

  • The client starts running until line *.
  • When the client invokes incrementBy, it's really invoking a stub
  • The client stub for incrementBy will:
    • Construct a message containing something like "Please call incrementBy on argument 10"
      • This involves serializing all the arguments (in this case, just 10)
    • Send this message to the server and wait for a response
  • When the request message arrives at the server, the RPC framework will:
    • Parse the message, figure out what function is being requested, and deserialize the data to the right argument types
    • Invoke the "real" implementation of incrementBy on the provided arguments
    • Construct a message containing the result
      • This involves serializing the result (in this case, just 10, (assuming this is the first time f has run, so total was still 0))
    • Send this message back to the client
    • Such messages are called "responses"
  • When the response arrives back on the client, the stub continues executing:
    • Parse the response message
    • Return the result from the stub to the caller.

The key points to notice about RPCs are:

  • When we implement incrementBy on the server, we just wrote a normal function
  • When we wanted to call incrementBy on the client, we just wrote a normal function call
  • The RPC library has to do a bunch of work to make this nice interface possible

Finally, notice that this RPC is stateful: it causes the server to update some state

  • In this very simple example, just a global variable on the server
  • But state is very common, and often the server would actually store the state in a database or some other backend (and the server would communicate with that backend via further RPCs)
  • If you have an RPC that doesn't manipulate state on the server, then several optimizations (e.g., caching) become possible that are not possible for general stateful RPCs.

"Naive" RPC as a distributed protocol

  • So far we've focused on the programmer's perspective.
  • Now let's look at our RPC mechanism as a distributed protocol.
  • We will call the RPC mechanism we've described above "naive RPC", because as it turns out, it is lacking in several respects, and we will need to improve it further for it to be useful.

Naive RPC as a distributed protocol:

  • Preface:
    • Goals:
      • Allow clients to call functions that are executed on the server.
    • Desired fault model: Asynchronous unreliable message delivery with fail-stop node crashes
    • Challenges:
      • This is challenging because the network is unreliable and the server might crash, so it is difficult for the clients to tell whether their requests have been executed by the server or not.
      • (Handling server crashes will take us a few weeks to work up to properly, but it is an important eventual goal! For lab 1, we will mostly focus on tolerating network failures.)
  • Protocol:
    • Kinds of node:
      • There are two kinds of nodes: clients and servers.
      • There can be any number of clients and any number of servers.
    • State at each kind of node:
      • Client: None.
      • Server: None beyond what is required to execute the functions provided over RPC.
    • Messages:
      • Request message
        • Source: Clients
        • Destination: Servers
        • Contents:
          • What function the client is requesting to call.
          • The (serialized) arguments to pass to that function.
        • When is it sent?
          • Whenever a client wants to invoke an RPC.
        • What happens at the destination when it is received?
          • The server calls the (local) function on the (deserialized) provided arguments and sends a response message back to the client with the (serialized) return value.
      • Response message
        • Source: Servers
        • Destination: Clients
        • Contents:
          • The (serialized) return value from the function.
        • When is it sent?
          • When a server finishes executing a request.
        • What happens at the destination when it is received?
          • The client deserializes the return value and returns it to the application layer.

This is our first example of a distributed protocol. A few things to notice:

  • The description of the protocol is detailed, but higher level than code.
    • It should contain enough information that another 452 student could implement the protocol without having to make any "distributed" design decisions.
      • (They might have to make some "local" design decisions about what data structure to use to store state on each node. That's fine and in fact should not appear in the protocol.)
  • The description follows a structured format. We will talk more about this later. Most importantly, it lists the kinds of nodes, what state they store, what messages they exchange, and what they do when each message is delivered.
  • (Later protocols will use timers as well as messages.)

Our next question is: Does the protocol meet its goal?

  • Well, yes and no.
  • If there are no failures and the network works reliably, then this protocol works fine and achieves its goal.
  • What about if there are failures or the network is not reliable?
    • That's the fault model we actually care about!
  • It turns out, the answer is no. This protocol does not tolerate almost any failures.
  • To see why, we need to do a fault tolerance analysis.

Failures in Naive RPC

A fault tolerance analysis is answers to the following questions:

  • For each message, how does the protocol handle delays, reorderings, drops, and duplicates?
  • For each node, how does the protocol handle the case where that node crashes fail-stop?

Some important failure cases for RPC

  • What kinds of failures can happen in our failure model?
    • Messages delayed, reordered, dropped, and duplicated.
    • Nodes crash.
  • Ok, so what are our messages, and what are our nodes?
    • Messages: Request and Response
    • Nodes: Client and Server

So we have a bunch of questions to answer—one for each combination of message and network failure, and one for each node crashing.

Here are some good ones to start with for our discussion:

  • What if the request message gets dropped?
  • What if the server crashes fail-stop?
    • Before the request is sent?
    • After the request is sent but before it arrives at the server?
    • After the request is received by the server but before a response is sent?
    • After the response is sent?
  • What if the response message gets dropped?
  • What if a very old request message gets delivered later?
  • What if a very old response message gets delivered later?

The naive protocol fails in many of these cases:

  • What if the request message gets dropped?
    • The client waits forever and never gets its RPC executed. Bad.
  • What if the server crashes fail-stop?
    • Before the request is sent?
      • The client waits forever.
    • After the request is sent but before it arrives at the server?
      • The client waits forever.
    • After the request is received by the server but before a response is sent?
      • The client waits forever.
    • After the response is sent?
      • If the response is delivered to the client, life is good. No thanks to the protocol though :)
  • What if the response message gets dropped?
    • The client waits forever.
  • What if a very old request message gets delivered later?
    • The server (mistakenly?) executes it again. Bad if RPCs manipulate state.
  • What if a very old response message gets delivered later?
    • The client might (mistakenly?) assume the response is for the current request it sent rather than an old one, and return the wrong result to the application layer.

Detecting failures

In order to recover from failures, systems often take some corrective action (retransmit the message, talk to somebody else instead, start up a new machine, etc.).

A fundamental limitation in distributed computing is that it is impossible to accurately detect many kinds of failure:

  • Some are easy: if messages arrive out of order, you can tell by just numbering them sequentially.
  • Some are hard: if a node is really slow, another node might think it has crashed.
    • The only way to check if a node is up is to send it a message and hope for a response
    • But if the node is just super slow, you won't get a response for a while, and during that time you have no way to know if its because the node crashed or is slow (or maybe the network is slow or dropped your message or the response)
  • Another hard one: "network partition"
    • Nodes 1 and 2 can talk to each other just fine, and nodes 3 and 4 can do the same among themselves, but neither 1 nor 2 can talk to 3 or 4 or vice versa.
    • Very confusing if your mental model is "a node is either up or down"
      • In a sense, nodes 1 and 2 are "down" from the perspective of 3 and 4.
    • Important to realize that this is not a "new" kind of failure:
      • If you assume the network can drop any message, then the network can "simulate" a network partition.

The takeaway here is that nodes have only partial information about the global state of the system. We can't know if the server received our message unless we hear back from the server.

Towards tolerating failures in RPC

The first step is figuring out what we want to happen.

  • When a client sends a request message, what guarantees does it have about the RPC getting executed?
    • We call these guarantees "RPC semantics" because they define what an RPC means.
      • "semantics" is a fancy word for "meaning".

Four options for RPC semantics:

  • Naive (above, broken, no guarantees)
  • At least once (NFS, DNS, lab 1b, only possible if you are willing to block forever in the case that the network goes down permanently or the server goes down permanently)
  • At most once (common)
  • Exactly once (lab 1c, only possible if you are willing to block forever in the case that the network goes down permanently or the server goes down permanently)

Identifying requests

  • The most basic problem with Naive RPC is that the client cannot distinguish two different response messages from the server.
  • Consider this scenario from the client's perspective:
    • client sends request f(10)
    • client receives a response with result 20
    • client sends request f(15)
    • client receives a response with result 20
  • If we don't know anything about f, how would we know whether the second response is for f(15) or a late-arriving duplicate of the first response?
    • We can't!
  • To solve this problem, we need to uniquely identify requests, and then tie the response message to the request using the identifier.
  • Typical solution is to use per-client sequence numbers:
    • Each client numbers its requests with integers starting at 0.
    • Server includes the request number in its responses.
  • This solves the above scenario like this:
    • client sends request f(10) with sequence number 0
    • client receives a response with result 20 for sequence number 0
    • client sends request f(15) with sequence number 1
    • client receives a response with result 20 for sequence number 0
  • Now we can tell that the response was actually a duplicate.
  • If instead the client received a response that the result for sequence number 1 is 20
    • then we would know that it wasn't a duplicate and that the result just happened to be 20 again.

Naive RPC with request identifiers

Here is an updated protocol now that we have added request identifiers to Naive RPC.

  • Protocol:
    • Kinds of node:
      • There are two kinds of nodes: clients and servers.
      • There can be any number of clients and any number of servers.
    • State at each kind of node:
      • Client:
        • current sequence number, an integer, initially 0.
        • set of outstanding requests, a set of sequence numbers, initially empty.
      • Server: None beyond what is required to execute the functions provided over RPC.
    • Messages:
      • Request message
        • Source: Clients
        • Destination: Servers
        • Contents:
          • What function the client is requesting to call.
          • The (serialized) arguments to pass to that function.
          • A sequence number (integer)
        • When is it sent?
          • Whenever a client wants to invoke an RPC.
          • To start a new RPC, the client takes its current sequence number, \(n\), and sends a Request message to the server with \(n\) (and the function name and arguments). It then adds \(n\) to its set of outstanding requests and finally increments its current sequence number.
        • What happens at the destination when it is received?
          • The server calls the (local) function on the (deserialized) provided arguments and sends a response message back to the client with the (serialized) return value and the same sequence number as the request.
      • Response message
        • Source: Servers
        • Destination: Clients
        • Contents:
          • The (serialized) return value from the function.
          • A sequence number (integer)
        • When is it sent?
          • When a server finishes executing a request.
        • What happens at the destination when it is received?
          • The client checks if the sequence number is in its set of outstanding requests. If not, the message is ignored.
          • Otherwise, the client removes the message's sequence number from its set of outstanding requests, deserializes the return value and returns it to the application layer.

This protocol solves the problem of confusing response messages from different requests. However, it still suffers from many other problems, such as the client waiting forever if certain messages get dropped, and the server executing the same request multiple times.

RPC Semantics

Our options in more detail

Naive

  • The client might never receive a response (e.g., request gets dropped)
  • If the client receives a response message to a request:
    • we know nothing because we cannot distinguish response messages from different requests, so it's possible this response was to a previous request.
      • (technically, if this is the first request the client has sent, then it knows the request has been executed at least once, because there are no other previous response messages that this one could be a duplicate of.)
  • If the client does not receive a response message:
    • we know nothing: the request might have been executed (or it might not, or it might have been executed more than once)

Naive with request identifiers

  • The client might never receive a response (e.g., request gets dropped)
  • If the client receives a response message to a request:
    • then the request was executed at least one time (perhaps more than once)
  • If the client does not receive a response message:
    • we know nothing: the request might have been executed (or it might not, or it might have been executed more than once)

At least once

  • The client will eventually receive a response message (or die trying).
  • If the client receives a response message to a request:
    • then the request was executed at least one time (perhaps more than once)
  • If the client does not receive a response message:
    • it blocks forever retransmitting the request and waiting until it receives a response

At most once

  • The client might never receive a response (e.g., request gets dropped)
  • If the client receives a response message to a request:
    • then the request was executed exactly once
  • If the client does not receive a response message:
    • then the request might have been executed (or it might not, but definitely not more than once)

Exactly once

  • The client will eventually receive a response message (or die trying).
  • If the client receives a response message to a request:
    • then the request was executed exactly once
  • If the client does not receive a response message:
    • it blocks forever retransmitting the request and waiting until it receives a response

Implementing at least once

  • send request and wait for response
  • if you wait for a while and hear nothing, re-transmit request
  • if you re-transmit a few times and still hear nothing, keep trying
    • In practice, give up at some point and return error to the calling application
      • (Note that if you give up, then this is not really "at least once", since we're not sure whether the request was executed or not.)
    • Typically an RPC framework would throw some kind of exception from the stub, or maybe return a special error value to indicate that the call failed
    • This is pretty different from a local function call! It's an error that says "I couldn't call the function (I think, but I might have)"

Advantages:

  • The server keeps no state (beyond what is required to execute the functions locally) -- just executes the requests it receives

Disadvantages:

  • Can be difficult to build applications that tolerate operations happening more than once

When to use it:

  • Operations are pure (no side effects) or idempotent (doing it more than once is the same as doing it once)
  • For example:
    • reading some data
    • taking the max with some new data
      • say n is a global int on the server, then the operation n = max(n, x) is idempotent (where x is an argument to the request)

Implementing at most once and exactly once

Key ideas: filter out duplicate requests on server; retransmit forever to get exactly once.

  • Client sends request and waits for response
  • For exactly once semantics:
    • If client doesn't hear response after a while, re-transmit request with same sequence number
    • If still nothing after a few retries, give up and return error
      • If you keep retrying forever, that's "exactly once (or die trying)"
  • On the server, keep track of which (client, sequence number) pairs have been executed, and don't re-execute duplicate requests. Instead, return previously-computed response.
    • Important to realize that two different clients can use the same sequence number (they all start at 0!), so you need to store client names along with the sequence numbers

Advantages:

  • Usually easier to program on top of
  • Works well for stateful operations

Disadvantages:

  • Server has to keep state proportional to the number of requests (but see below for optimizations)

Implementation challenges:

  • If the client crashes and reboots, it may lose track of the last sequence number it used, and restart numbering at 0, which would cause bugs.
    • In the labs, we assume a fail-stop model where nodes crash but don't reboot.
    • In practice, one way to handle this is to change the client's name every time it restarts.
  • How can we reduce the server state?
    • Option 1: client tells server "I have heard your response to all sequences numbers \(\le x\)"
      • server can discard responses for those requests (and remember that it discarded them somehow!)
    • Option 2: only allow one outstanding RPC per client at a time
      • when request numbered \(x + 1\) arrives, can discard all previous state about that client
    • The labs use option 2.

Protocol for at least once

Here is a written out protocol for At Least Once RPC.

The changes relative to Naive RPC with Request Identifiers are marked with "NEW".

  • Protocol:
    • Kinds of node:
      • There are two kinds of nodes: clients and servers.
      • There can be any number of clients and any number of servers.
    • State at each kind of node:
      • Client:
        • current sequence number, an integer, initially 0.
        • set of outstanding requests, a map from sequence numbers to request messages, initially empty.
          • NEW changed from set to map, so that we can store the requests. We'll need them to retransmit later.
      • Server: None beyond what is required to execute the functions provided over RPC.
    • Messages:
      • Request message
        • Source: Clients
        • Destination: Servers
        • Contents:
          • What function the client is requesting to call.
          • The (serialized) arguments to pass to that function.
          • A sequence number (integer)
        • When is it sent?
          • Whenever a client wants to invoke an RPC.
          • To start a new RPC, the client takes its current sequence number, \(n\), and sends a Request message to the server with \(n\) (and the function name and arguments). It then adds \(n \mapsto Request(n, f, x)\) to its map of outstanding requests and increments its current sequence number. Finally, The client also sets a RequestRetransmit timer with sequence number \(n\)
          • NEW: changed the set to a map, also added setting the timer
        • What happens at the destination when it is received?
          • The server calls the (local) function on the (deserialized) provided arguments and sends a response message back to the client with the (serialized) return value and the same sequence number as the request.
      • Response message
        • Source: Servers
        • Destination: Clients
        • Contents:
          • The (serialized) return value from the function.
          • A sequence number (integer)
        • When is it sent?
          • When a server finishes executing a request.
        • What happens at the destination when it is received?
          • The client checks if the sequence number is in its map of outstanding requests. If not, the message is ignored.
          • Otherwise, the client removes the message's sequence number from its map of outstanding requests, deserializes the return value and returns it to the application layer.
    • Timers:
      • RequestRetransmit
        • Set by clients
        • Contents: a sequence number (integer)
        • Set whenever a client sends a new RPC
        • What happens when it fires?
          • The client checks if the timer's sequence number is still in its map of outstanding requests. If not, the timer is ignored.
          • Otherwise, the client retransmits the request message stored in the outstanding request map for this sequence number, and then resets the timer again (with same sequence number).

Does TCP solve all our problems?

TCP: reliable bi-directional byte stream between two nodes

  • Retransmit lost packets
  • Detect and filter out duplicate packets
  • Useful! Most RPCs sent over TCP in practice

But TCP itself can time out

  • For example, if the server crashes or the network goes down for long enough.
  • Usually TCP will retransmit a few times, but after say 30 seconds, it gives up and returns an error.
  • Application needs to be able recover, usually involves establishing a new TCP connection.
  • Question: on reconnection, were my old requests executed or not?
  • Answer: TCP sure won't tell you that, so you need to implement an application-layer mechanism (including application-layer sequence numbers, probably) to figure it out.

What if the server crashes?

  • If the list of all previous responses is stored in server memory, it will be lost on reboot
  • After reboot, server would incorrectly re-execute old requests if it received a duplicate.
  • One option would be to keep the state in non-volatile storage (hdd, ssd)
  • Another option is to give server new address on reboot (fail stop model)
    • this is what the labs do

More about RPC in practice

Serialization

  • Refers to converting the in-memory representation of some (typed) data into a linear sequence of bytes
  • Usually so we can write those bytes to the network or to disk.
  • Above we did an example that took an int and returned an int.
  • Serialization just encodes the int as its four bytes.
  • Other primitive types (float, long, etc.) work similarly.
  • What about pointers?
  • Should we just send the pointer over the network?
    • No, doesn't make sense because the server won't be able to dereference it
    • Whatever that pointer points to on the client is probably not even in the servers memory, much less at the same address.
  • Could convert to a "global reference", or use some kind of global addressing system
    • Definitely possible! Complicated.
  • Instead, most of the time what we do is pass a copy of the data pointed to by the pointer.
  • For example, if we have an RPC to write to a file on the server like void write(char* data) or something, the client will send the server a copy of the data buffer.
  • Similarly, if we have an RPC to read the file like char* read(), then we want to get a copy of that returned data back on the client.
  • More interestingly, what if we had a void read(char* dest) function that wrote the answer into its argument
  • The dest pointer is an "out pointer", it is passed just so that the function read can write to it.
  • Then we don't need to send anything, really, to the server
  • But we need the server to send us back the contents of dest after the function call!
  • Such "out pointers" have to be handled specially by RPC frameworks
  • In the labs, we use Java, which has serialization built in. This will make implementing our RPCs relatively easy.

RPC vs procedure calls

  • From the application programmer's perspective, very similar to a normal (local) procedure call.
  • Some additional complexities under the hood that don't show up with local procedure calls.
  • "Binding": the client RPC library needs to know where to find the server
    • Need to make sure the function we want to call actually exists on the server
    • And that the server is running the version of the software we expect
    • Binding is often solved through "service discovery"
    • Have one well-known server whose job it is to keep track of all the servers, their names/addresses, what RPCs they support, what version they're running, etc.
    • Then clients first talk to the the service discovery server to find a server that will work for them.
  • Implementing the stubs
    • The RPC framework often has a compiler-like thing that will autogenerate code to do serialization, send, recv, deserialization, etc.
    • Takes as input the signatures of the procedures.
  • Performance
  • A local procedure call is very fast, on the order of 10 instructions (a few nanoseconds)
  • RPC to a machine in the same data center: about 100 microseconds (10k times slower than local call)
  • RPC to a machine on other side of planet: about 100 milliseconds (10 million times slower than a local call)
  • Solutions:
    • Issue multiple requests in parallel
    • Batch requests together
    • Cache results of requests

Design documents

  • A design document describes a distributed protocol at a high-level but with enough detail that a competent programmer could implement it without having to think about any distributed systems-y aspects of the problem (a big ask!).
  • We will follow a highly structured template in this class, given here.

Design doc for lab 1

  • You will write a design document for lab 1 part c (exactly once semantics).
  • Follow the template.
    • Don't forget to include timers (which were not present in Naive RPC).
    • You can submit your design doc in any readable format as a PDF on gradescope.
      • Handwritten is fine, so is plain text ascii, word doc, \(\LaTeX\), or anything else.
    • Be sure these sections appear explicitly with exactly these titles in large font in your doc
      • Preface
      • Protocol
      • Analysis
      • Conclusion
    • Besides those section titles, you do not have to follow the rest of the template exactly, but you must include the spirit of all the information requested in the template in some format.
    • Feel free to include diagrams or informal discussion wherever you see fit.
  • For lab 1, you can omit these subsections from the Analysis section:
    • Invariants and stable properties (we will add this starting in lab 2)
    • Performance analysis (we will add this starting in lab 4)