Design documents

This page describes how to write a design document.

  • A design document describes a distributed protocol at a high-level but with enough detail that a competent programmer could implement it without having to think about any distributed systems-y aspects of the problem (a big ask!).
  • Notice that this definition of a design document means that it is very different from other forms of software documentation that you may be familiar with. It is not a goal of a design document to describe how your implementation works at a low level. Instead, you should describe conceptually how your system solves the problem it is trying to solve, in such a way that someone else could implement it without looking at your implementation.
  • We will follow a highly structured template in this class, given below.

Advice on writing a design doc

  • You can generally omit pieces of "state" that never change, such as the addresses of other nodes. Think of these as configuration parameters.
  • Try to use math/high-level concepts, rather than Java-specific concepts. For example, say "map" instead of "HashMap". Similarly, prefer saying that some piece of state is "optional" rather than "null". (Though null is ok too, but you should explicitly document any piece of state that is allowed to be null.)
  • When describing a piece of state that is optional/possibly null, be sure to describe what it means if that state is not present. Can the state only be missing at the beginning of time, or can it be present sometimes and then later absent again?
  • As you will see in lecture and section, we often explain distributed systems protocols using pictures. You will likely also find pictures as a way to organize/explain your thoughts. Note that a picture isn't a complete design - you also need to include text that describes specific actions when a specific message is received. We have found the website draw.io to be easy to use (it is reportedly what Amazon uses internally), but feel free to draw on paper and include scans.
    • A basic figure will include just the normal case data flow (what happens when there are no failures or packet losses).
    • Another figure might include failure cases, such as what happens on a timeout (eg., retransmission).
    • Another figure might include background maintenance, such as heartbeats or garbage collection.
  • Feel free to include commentary that describes alternative designs you considered, and why you chose your design over those alternatives.
  • Timers should almost always follow one of these two design patterns:
    • Tick pattern:
      • Timer contains no data
      • Timer is set in the initial state or near the "beginning of time"
      • Timer is unconditionally reset every time it fires
    • Discard pattern:
      • Timer contains enough data to tell whether it is "current"
      • Timer is set when some region of time starts
      • When the timer fires, it is checked whether it is still "current", and if not, it is ignored and not reset. Otherwise it is processed and reset.

More advice

  • You are not bound to the implementation you describe in your design doc. You may (should!) change your implementation after you submit or after you see your feedback, if that you think that will produce a cleaner design and/or one that is more likely to work.
  • Please read your specific feedback. You likely got feedback on every question/section, even if you got full credit. To find the feedback, you must click through every page for every section.
  • We recommend that you follow the template. The questions in the template force you to address important edge cases.
  • The bulk of your protocol section is in the answer to the question “what happens when this message is received at this destination?”, so be comprehensive.
    • What edge cases might you be checking for before you do a certain execution?
    • Do certain states change how you handle a specific message?
  • Multiple short bullet points are way more readable than fewer long bullet points.
  • Feel free to give messages/objects/things names, define them, and reference them throughout your design doc. This is preferred because it makes the doc more concise and clear.
  • Try to make your design doc application agnostic (avoid mentioning a specific application and commands, like KVStore and Get/Puts) and language agnostic (avoid Java specific details).
  • Make note of the length of your timers, especially in relation to other timers (eg, how many times will your ping timer go off versus your ping check timer?)
  • Correctness section should look something like this:
    • For each message type, state why the protocol remains correct and live in the following cases:
      • One or more of the message is delayed
      • One or more of the message is dropped
      • One or more of the message is duplicated
      • One or more of the message is reordered
    • State why the protocol remains correct and live even when nodes crash at any point in time in the execution (barring any limitations/assumptions you have previously stated)
    • State why the protocol correct/live even when a group of nodes are partitioned from some but not all nodes, and then the partition is resolved (messages resume between nodes that had been partitioned).

Template with commentary

(Commentary is in [square brackets].)

  • Preface
    • Goals: [What is the system trying to achieve?]
    • Desired fault model: [What failures does the system handle automatically?]
    • Challenges: [What makes the goal hard to achieve in the fault model? (If it was easy, we probably wouldn't need a design document.)]
    • Assumptions: [What assumptions about client behavior or fault model does the protocol make in order to work?]
  • Protocol
    • Kinds of node: [For example, "client" or "server". Each node is permanently of one kind. You cannot start a node as one kind and then later switch it to another kind.]
      • For each kind of node, what temporary roles it can play, if any. [For example, "primary server" or "backup server" or "idle server". A role is temporary and might change over time.]
    • State at each kind of node: [What data does each kind of node need to do its job? Is certain data only present when the node is performing a certain role?]
      • For each piece of state, what type of data is it (integer, set of strings, etc.)?
        • If you refer to any auxiliary types, be sure to define them!
      • For each piece of state, what does it mean intuitively (if not obvious)?
      • For each piece of state, how is it initialized at the beginning of time?
      • For each piece of state, are there any important constraints on how it evolves over time (monotonically increasing, etc.)?
    • Messages: [How do the nodes communicate with each other over the network?]
      • List all the kinds of messages, ideally with a diagram showing the relationship of nodes and messages to each other - who sends what - in both the normal (non-failure case) and when there are packet losses/timeouts
      • For each kind of message, list:
        • Source: [What kind of node (and role, if applicable) sends the message?]
        • Destination: [What kind of node (and role, if applicable) receives the message?]
        • Contents: [What data is sent in the message?]
        • When is it sent? [In other words, under what conditions?]
          • Can this message be sent spontaneously by a node, or only in response to another message/timer?
          • If yes, describe what else (if anything) happens when the message is sent spontaneously by a node.
        • What happens at the destination when it is received?
          • [When should the message be ignored (e.g., because it is old or no longer relevant, or because it is "from the future" and not relevant yet)?]
          • [If a message is not ignored, how does the receiver's state get updated as a result?]
          • [What messages get sent as a result of receiving this message?]
          • [What timers get set as a result of receiving this message?]
    • Timers:
      • List all the kinds of timers, ideally with a diagram
      • For each kind of timer, list:
        • What kind of node sets it? [What kind of node (and role, if applicable)?]
        • Contents: [What data (if any) is stored in the timer?]
        • When is it set? [In other words, under what conditions?]
        • What happens when it fires?
          • [When should the timer be ignored (e.g., because it is old or no longer relevant)?]
          • [If a timer is not ignored, how does the local node's state get updated as a result?]
          • [What messages get sent as a result of this timer firing?]
          • [Should this timer get reset?]
          • [What (other) timers get set as a result of this timer firing?]
  • Correctness/Liveness Analysis
    • As an overview, provide a table: on one side message types, in the other dimension possible failure conditions (drop, delay, reorder, duplicate), where the text in each box outlines why the protocol works for that case (or references to a longer explanation in the text). Some table entries may be not applicable or use the same explanation as some other table entry; that is useful to have organized in one place.
    • For each message, why is the protocol correct/live even if the network delays, reorders, drops, or duplicates that message? [There is overlap in the cases - a long enough delay could be a reorder or cause a timeout.]
    • Why is the protocol correct/live even if the network delays or drops a group of messages? [For example, what if the network drops both the reply and the resend of an RPC, or a long enough sequence of resends, or a large portion of messages sent?]
    • For each node, why is the protocol correct/live even if that node crashes or disappears (doesn't send or recieve messages) for a long time and then later is reconnected? [Note that this overlaps with the previous case - disappearing is like dropping or delaying a group of messages.]
    • Why is the protocol correct/live even when a group of nodes are partitioned from some but not all nodes, and then the partition is resolved. [Again, note the overlap with the prior case. You may assume that the network is always commutative and transitive - if A can reach B, then B can reach A, and if A can reach B and B can reach C, then A can reach C - provided they retry for long enough. This is a simplification since real networks - eg, the Internet - are sometimes are not commutative or not transitive.]
    • [Correctness: the program doesn't do the wrong thing; liveness: the program does something useful (esp if it is also the correct thing to do. In general, the failure model of each particular lab will define which cases need to be handled for correctness and/or liveness, as the previous bullet explains for the network model used across the labs.]
    • [If you choose to ignore drops because they are similar to arbitrary delays, be sure your discussion of delays includes the case where the delay is extremely long. (So, don't say something like "if the message is delayed, it's fine because it will eventually be delivered". Instead, describe how your system would handle the case where the message takes so long to arrive that you're not willing to wait that long.) Alternatively, you can describe how you handle drops explicitly.]
  • Conclusion
    • Goals achieved:
      • For each goal, summarize why the protocol guarantees that the system achieves the goal.
    • Limitations:
      • What goals are only achieved sometimes? [For example, because too many failures occurred.]

Template without commentary (copy-paste-able)

  • Preface
    • Goals:
    • Desired fault model:
    • Challenges:
    • Assumptions:
  • Protocol
    • Kinds of node:
      • For each kind of node, what temporary roles it can play, if any.
    • State at each kind of node:
      • For each piece of state, what type of data is it (integer, set of strings, etc.)?
        • If you refer to any auxiliary types, be sure to define them!
      • For each piece of state, what does it mean intuitively (if not obvious)?
      • For each piece of state, how is it initialized at the beginning of time?
      • For each piece of state, are there any important constraints on how it evolves over time (monotonically increasing, etc.)?
    • Messages:
      • List all the kinds of messages, e.g., in a diagram
      • For each kind of message, list:
        • Source:
        • Destination:
        • Contents:
        • When is it sent?
          • Can this message be sent spontaneously by a node, or only in response to another message/timer?
          • If yes, describe what else (if anything) happens when the message is sent spontaneously by a node.
        • What happens at the destination when it is received?
    • Timers:
      • List all the kinds of timers
      • For each kind of timer, list:
        • What kind of node sets it?
        • Contents:
        • When is it set?
        • What happens when it fires?
  • Correctness/Liveness Analysis
    • As an overview, provide a table: on one side message types, in the other possible failure cases.
    • For each message, why is the protocol correct/live even if the network delays, reorders, drops, or duplicates that message?
    • Why is the protocol correct/live even if the network delays or drops a group of messages?
    • For each node, why is the protocol correct/live even if that node crashes or disappears (doesn't send or recieve messages) for a long time and then later is reconnected?
    • Why is the protocol correct/live even when a group of nodes are partitioned from some but not all nodes, and then the partition is resolved.
  • Conclusion
    • Goals achieved:
      • For each goal, summarize why the protocol guarantees that the system achieves the goal.
    • Limitations:
      • What goals are only achieved sometimes?