Google's introduction to distributed system design

What is a distributed system?

 - "an application that executes a collection of protocols to
   coordinate the actions of multiple processes on a network, such
   that all components cooperate together to perform a single or small
   set of related tasks."

 - application -- what are some examples of distributed systems?
     - web, email, DNS, NFS, NTP, IP routing, ...
     - is an OS a distributed system?  (increasingly, yes)

 - multiple processes -- implications?
     + if designed well, scalable

     + if designed well, {fault tolerant, highly available} through redundancy

     - need to worry about partial failures (of processes, of network).
       Biggest complication in distributed systems, but a very fundamental
       truth!  Design for failure.

     - need to worry about consistency of state
        - can reduce this to order in which messages are processed
          (replicated state machine model)

     - need to worry about security
        - honest but mutually distrusting processes
        - or, worst case: byzantine processes

 - actions -- what are these?
     - several kinds are possible
        - mutating local state (volatile or durable)
        - sending a message to another process
        - external output: dispensing cash, launching missiles,
          turning on a pixel

 - protocols -- why?
     - processes separated by a network; coordinate through messages.
       need rules for the format, sequencing, and meaning of messages
        - basically a "network API"

     - what's hard about this?
         - all the usual tricks about API design -- generality vs.
           specificity, precision of semantics, etc.
         - possibility of multiple versions coexisting

 - network -- implications?
     - have to worry about naming / addressing -- who do I communicate
       with?  often know what data I want, but need to have some kind
       of a naming/location system to translate that into the host/IP
       that I communicate with to get the data
         - lots of interesting subproblems -- load balance, closest
           server selection, ...

     - have to worry about network performance characteristics
        - throughput, and congestion
        - latency (possibly unbounded)
        - packet loss rates
        - possibly even message duplication

     - network failures/partitions can happen
        - makes it hard to achieve consistency, availability
        - only way to monitor the health of a remote process is through
          message exchange; in all practical (asynchronous) networks,
          cannot distinguish between a slow host, a failed host, and
          an unreachable host
        - if I send you a message and don't get a response, can't tell
          if you got/processed the message or not.  means its hard to
          come to agreement.

     - messages can be spoofed -- need some authentication and
       integrity mechanisms

 - cooperate to perform a task
     - what does this mean in practice?  deep architectural considerations
        - where is the "truth" of the state kept?
           - one place: bad for scalability, availability, performance of reads
           - many places: bad for consistency, performance of writes

        - where does code execute?
           - client/server, peer-to-peer, etc.

 - Failures, failures, failures - many kinds
     - of processes
        - halting -- component just stops
        - fail-stop -- failure message sent out
        - byzantine -- arbitrary, including malicious
     - of networks
        - timing -- message is delayed
        - drops -- packets are lost, often due to congestion (wired)
        - corruption -- malicious or accidental
        - link failure -- implications on capacity, maybe reachability
        - partition -- network splits into disjoint subnetworks