Design documents

This page describes how to write a design document.

A design document describes a distributed protocol at a high-level but with enough detail that a competent programmer could implement it without having to think about any distributed systems-y aspects of the problem (a big ask!).
Notice that this definition of a design document means that it is very different from other forms of software documentation that you may be familiar with. It is not a goal of a design document to describe how your implementation works at a low level. Instead, you should describe conceptually how your system solves the problem it is trying to solve, in such a way that someone else could implement it without looking at your implementation.
We will follow a highly structured template in this class, given below.

Advice on writing a design doc

You can generally omit pieces of "state" that never change, such as the addresses of other nodes. Think of these as configuration parameters.
Try to use math/high-level concepts, rather than Java-specific concepts. For example, say "map" instead of "HashMap". Similarly, prefer saying that some piece of state is "optional" rather than "null". (Though null is ok too, but you should explicitly document any piece of state that is allowed to be null.)
When describing a piece of state that is optional/possibly null, be sure to describe what it means if that state is not present. Can the state only be missing at the beginning of time, or can it be present sometimes and then later absent again?
Feel free to include commentary that describes alternative designs you considered, and why you chose your design over those alternatives.
Timers should almost always follow one of these two design patterns:
- Tick pattern:
  - Timer contains no data
  - Timer is set in the initial state or near the "beginning of time"
  - Timer is unconditionally reset every time it fires
- Discard pattern:
  - Timer contains enough data to tell whether it is "current"
  - Timer is set when some region of time starts
  - When the timer fires, it is checked whether it is still "current", and if not, it is ignored and not reset. Otherwise it is processed and reset.

Template with commentary

(Commentary is in [square brackets].)

Preface
- Goals: [What is the system trying to achieve?]
- Desired fault model: [What failures does the system handle automatically?]
- Challenges: [What makes the goal hard to achieve in the fault model? (If it was easy, we probably wouldn't need a design document.)]
- Assumptions: [What assumptions about client behavior or fault model does the protocol make in order to work?]
Protocol
- Kinds of node: [For example, "client" or "server". Each node is permanently of one kind. You cannot start a node as one kind and then later switch it to another kind.]
  - For each kind of node, what temporary roles it can play, if any. [For example, "primary server" or "backup server" or "idle server". A role is temporary and might change over time.]
- State at each kind of node: [What data does each kind of node need to do its job? Is certain data only present when the node is performing a certain role?]
  - For each piece of state, what type of data is it (integer, set of strings, etc.)?
    - If you refer to any auxiliary types, be sure to define them!
  - For each piece of state, what does it mean intuitively (if not obvious)?
  - For each piece of state, how is it initialized at the beginning of time?
  - For each piece of state, are there any important constraints on how it evolves over time (monotonically increasing, etc.)?
- Messages: [How do the nodes communicate with each other over the network?]
  - List all the kinds of messages
  - For each kind of message, list:
    - Source: [What kind of node (and role, if applicable) sends the message?]
    - Destination: [What kind of node (and role, if applicable) receives the message?]
    - Contents: [What data is sent in the message?]
    - When is it sent? [In other words, under what conditions?]
      - Can this message be sent spontaneously by a node, or only in response to another message/timer?
      - If yes, describe what else (if anything) happens when the message is sent spontaneously by a node.
    - What happens at the destination when it is received?
      - [When should the message be ignored (e.g., because it is old or no longer relevant, or because it is "from the future" and not relevant yet)?]
      - [If a message is not ignored, how does the receiver's state get updated as a result?]
      - [What messages get sent as a result of receiving this message?]
      - [What timers get set as a result of receiving this message?]
- Timers:
  - List all the kinds of timers
  - For each kind of timer, list:
    - What kind of node sets it? [What kind of node (and role, if applicable)?]
    - Contents: [What data (if any) is stored in the timer?]
    - When is it set? [In other words, under what conditions?]
    - What happens when it fires?
      - [When should the timer be ignored (e.g., because it is old or no longer relevant)?]
      - [If a timer is not ignored, how does the local node's state get updated as a result?]
      - [What messages get sent as a result of this timer firing?]
      - [Should this timer get reset?]
      - [What (other) timers get set as a result of this timer firing?]
Analysis
- Fault tolerance analysis
  - For each message, how does the protocol handle delays, reorderings, drops, and duplicates?
    - [There is some redundancy here. If you handle (arbitrarily long) delays and duplicates, that is enough, because a drop is "just an infinitely long delay" and a reordering is just a message getting delayed longer than the next message. If you choose to omit drops, be sure your discussion of delays includes the case where the delay is extremely long. (So, don't say something like "if the message is delayed, it's fine because it will eventually be delivered". Instead, describe how your system would handle the case where the message takes so long to arrive that you're not willing to wait that long.) Alternatively, you can describe how you handle drops explicitly.]
  - For each node, how does the protocol handle fail-stop crashes?
- Invariants and stable properties: (omit until lab 2, and omit in lab 3-4 because we decided it was a bad idea)
  - What important facts are invariants (always true) about the state of each node?
    - List the invariants
      - [As a rule of thumb, any time two different nodes store state that is "related" somehow (e.g., because they are replicas of each other), you should have an invariant that describes the relationship (e.g., how out of sync the replicas can possibly be.)]
    - For each invariant:
      - [It is common for one invariant or stable property to depend on another. Document this explicitly whenever it happens.]
      - Why is the invariant true at the beginning of time before any events have occurred?
      - For each message, if the invariant is true, why does delivering that message not cause the invariant to become false?
        
        [You can skip "boring" messages that are not relevant to the invariant.]
      - For each timer, if the invariant is true, why does firing that timer not cause the invariant to become false?
        
        [You can skip "boring" timers that are not relevant to the invariant.]
  - What important facts can a node depend on being stable (once it is true, it never becomes false again)?
    - List the stable properties
      - [As a rule of thumb, you should probably have at least one stable property per message, which tells you what the receiver can safely conclude about the state of the sender when the receiver receives the message. (Stability is essential here because the message will get received (much) later than it was sent.)]
    - For each stable property:
      - [It is common for one invariant or stable property to depend on another. Document this explicitly whenever it happens.]
      - For each message, if the property is true, why does delivering that message not cause the property to become false?
        
        [You can skip "boring" messages that are not relevant to the property.]
      - For each timer, if the property is true, why does firing that timer not cause the property to become false?
        
        [You can skip "boring" timers that are not relevant to the property.]
- Performance analysis: (omit until lab 4)
  - During a "normal" workflow without any network problems or node failures, how many messages does the system send in order to execute a single client request? How much data is exchanged in those messages?
  - How does the answer to the previous question change when various messages get dropped?
  - How does the answer to the previous question change when various nodes crash fail-stop?
  - After \(n\) client requests have been successfully executed, how much state does the system store?
Conclusion
- Goals achieved:
  - For each goal, explain why the protocol and analysis (invariants, stable properties, fault tolerance analysis, and performance analysis together) guarantee that the system achieves the goal.
    - [Be specific! Say exactly what stable property or invariant guarantees the goal and how.]
- Limitations:
  - What goals are only achieved sometimes? [For example, because too many failures occurred.]

Template without commentary (copy-paste-able)

Preface
- Goals:
- Desired fault model:
- Challenges:
- Assumptions:
Protocol
- Kinds of node:
  - For each kind of node, what temporary roles it can play, if any.
- State at each kind of node:
  - For each piece of state, what type of data is it (integer, set of strings, etc.)?
    - If you refer to any auxiliary types, be sure to define them!
  - For each piece of state, what does it mean intuitively (if not obvious)?
  - For each piece of state, how is it initialized at the beginning of time?
  - For each piece of state, are there any important constraints on how it evolves over time (monotonically increasing, etc.)?
- Messages:
  - List all the kinds of messages
  - For each kind of message, list:
    - Source:
    - Destination:
    - Contents:
    - When is it sent?
      - Can this message be sent spontaneously by a node, or only in response to another message/timer?
      - If yes, describe what else (if anything) happens when the message is sent spontaneously by a node.
    - What happens at the destination when it is received?
- Timers:
  - List all the kinds of timers
  - For each kind of timer, list:
    - What kind of node sets it?
    - Contents:
    - When is it set?
    - What happens when it fires?
Analysis
- Fault tolerance analysis
  - For each message, how does the protocol handle delays, reorderings, drops, and duplicates?
  - For each node, how does the protocol handle fail-stop crashes?
- Invariants and stable properties: (omit until lab 2, and omit in lab 3-4 because we decided it was a bad idea)
  - What important facts are invariants (always true) about the state of each node?
    - List the invariants
    - For each invariant:
      - Why is the invariant true at the beginning of time before any events have occurred?
      - For each message, if the invariant is true, why does delivering that message not cause the invariant to become false?
      - For each timer, if the invariant is true, why does firing that timer not cause the invariant to become false?
  - What important facts can a node depend on being stable (once it is true, it never becomes false again)?
    - List the stable properties
    - For each stable property:
      - For each message, if the property is true, why does delivering that message not cause the property to become false?
      - For each timer, if the property is true, why does firing that timer not cause the property to become false?
- Performance analysis: (omit until lab 4)
  - During a "normal" workflow without any network problems or node failures, how many messages does the system send in order to execute a single client request? How much data is exchanged in those messages?
  - How does the answer to the previous question change when various messages get dropped?
  - How does the answer to the previous question change when various nodes crash fail-stop?
  - After \(n\) client requests have been successfully executed, how much state does the system store?
Conclusion
- Goals achieved:
  - For each goal, explain why the protocol and analysis (invariants, stable properties, fault tolerance analysis, and performance analysis together) guarantee that the system achieves the goal.
- Limitations:
  - What goals are only achieved sometimes?