A Reliable Multicast Framework for Light-weight Sessions and
Application Level Framing
S. Floyd, V. Jacobson, C. Liu, S. McCanne, and L. Zhang

This paper describes the Scalable Reliable Multicast (SRM) system, which was
based on the Clark and Tennenhouse, Application Level Framing.  The 
paper proposes a reliable multicast system that intentionally leaves ordered
delivery as an add-on feature.  Another key point about SRM is that it 
heavily leverages the existing IP multicast protocol, where receivers 
announce that they are interested in joining a multicast without having any
knowledge of the group membership.  The paper also does a good job of 
explaining how many unicast concepts do not apply well in a multicast world,
such as loss detection/recovery, scalability, and shared communication 
state.

The authors also describes "wb", a distributed whiteboard tool, which was an 
application based on the SRM framework.  I thought this was really helpful
as a concrete example of a multicast application that has limited use for
ordering.  One particularly interesting anecdote was how drawing operations
are idempotent and therefore don't need ordering information, with the 
exception of "delete", which can be "patched after the fact".  Also, the
assumptions for wb's design explain much of the SRM design, such as unique 
and persistent names for data and sources, as well as the indistinction 
between senders and receivers within a multicast group.

Loss detection and repair is the main focus of this paper.  This is 
accomplished by holding each individual responsible to detect loss and then
request retransmission.  Even this request is multicast to save other nodes
who had the same problem from having to make the same request.  What is
interesting about this model is how to prevent too many duplicate requests
for repairs when multiple nodes miss something, plus how to reduce useless
repair messages taking up bandwidth across parts of the network that did
not experience the loss.  The first issue is solved with timing or 
randomness, depending on the network topology.  The second issue is 
addressed using the concept of local neighborhoods.  

Chains are topologies that can easily take advantage of "deterministic
suppression" by using timers to control when losses are reported to allow
for others that are closer to the source of the problem to respond first.
However, when the authors state that "we assume that packets take one
unit of time to travel each link", I did wonder what this implied about
the use of synchronized clocks.  Stars use "probabilistic suppression",
which uses randomness since the time/distance is indistinguishable across
any pair of nodes.  Finally, bounded-degree trees use a combination.

The paper describes experiments to validate the approaches, but notes
that no one set of timing values will fit all scenarios.  The authors 
then describe an adaptive algorithm that adjusts some of the timing values
depending on factors such as message transmission frequency.

The local recovery section was interesting, but I felt like it wasn't 
fully fleshed out.  How do you know when to send to your local 
neighborhood, instead of the global network.  Wouldn't you really want
to send to one node beyond the local neighborhood to find out how limited
the problem is.  I did notice that in the future work section they 
mentioned spending more time on this.

Overall, I thought this seemed to be a much more constructive look at 
the problem with CATOCS that Cheriton and Skeen had raised.