A Reliable Multicast Framework for Light-weight Sessions and
Application Level Framing
Floyd, Jacobson, Liu, McCanne, and Zhang


I will start by saying that this paper (which I hadn't read before,
only some of the other papers on SRM), is much more upfront about the
number one thing that has always annoyed me about SRM: that it is a
toolkit and a (family of) algorithms, and not a component or layer on
its own.  Yes, there are often good reasons to tie the application
tightly into a network protocol like this (this is the ALF argument,
which they cite, but for which this still seems a bit of a stretch),
but there are also lots of good reasons not to, and I think that most
programmers would rather trade performance for a simple,
easy-to-describe interface and a well-encapsulated implementation.

Now, if someone could more properly show that there *isn't* a
reasonable way to encapsulate reliability (or for that matter ordering
or congestion control) in a multicast world, then that would be
something else.  Can we come up with a defensible understanding of
what makes multicast different from unicast in this regard (since it
doesn't seem like the unicast equivlent, TCP, was nearly so hard to
figure out or get pretty much right, particularly if you aren't so
worried about congestion control).  

I would point out here, and largely ignore the issue from here on,
that SRM only looks at one of the TCP three, and totally ignores both
ordering (saying only that by using ALF applications will be able to
talk about out-of-order data using application names) and congestion
control (wb just fixes a per-sender bandwidth, and gets away with it
because the drawing traffic is small potatoes).  Given that this is
already a framework, I don't see any problem with it.  The only reason
the three are tied together in TCP is to make it easier to specify
which combination you want (and you can't argue that "any color you
want, as long as it's black" isn't *easy*) and because all three use
information gleaned from the same stream of acks, and a layering model
doesn't want to pass that dense information around between layers.

As a final rant on the subject, since SRM is really a toolkit as much
as a protocol, it would be better if there was a better description of
how the toolkit interface works.  Otherwise it's hard to evaluate the
work from the programmer's point of view: how different is building an
SRM application from building a TCP application?

Stepping down to the protocol level, the issue that annoys me most is
the arguable reliance on every receiver knowing their RTT not only to
the sender, but to every other receiver as well (modulo the
hierarchical session messages sketched at the end).  The notion of
exchanging messages (albeit at a lower rate than the data itself)
between all pairs seems in opposition to many of the goals of IP
multicast, both in terms of scalability and anonymity of receivers.  I
think it's significant to note that wb, the only implementation they
describe, does not implement this, and simply drops a constant of 30
seconds into the places in the formula where the RTT would go.  They
don't analyze the case where the RTT is stubbed out this way, and I
have to imagine that it degrades the theoretical performance
substantially.

Whew.  Okay, I suppose I ought to stop ranting and go over the basic
technique.  As an aside, I will likely use "bottleneck"
interchangeably with "link (or output port) where the loss happened,"
because that's the way I think about it.  

The idea is to shift the burden of retransmission from the original
sender of the data to the receivers close to where the loss occurred.
Note that this is not the same as either active networks or link-level
retransmission, because when we say "receivers near the bottleneck" we
do *not* mean "routers near the bottleneck."  If there isn't a
receiver immediately after the bottleneck so be it, there is always a
"nearest" even if it isn't very near.

The potential upside of this is actually quite significant: if the
loss happens far away from the sender, and close to a number of
receivers, it is likely that some receiver who did receive the missing
data is substantially closer to those missing the data than the
original sender.  Therefore, in the best case we might be able to fill
in the hole in less than a full round-trip time.  In practice,
however, they never actually manage this, and it's easy to imagine
that the slop time needed in the timeouts will always exceed the
typical RTTs.

But the real advantage of querying nearby receivers to find a copy of
the data is that we don't need to involve the original sender in
recovery at all.  This is important, both to reduce state and
computation at the sender, and because it's easy to construct
situations where even a small loss rate at each receiver adds up to
nearly every packet being lost somewhere in the tree.  If all recovery
went through the sender then most of its work would likely end up
going to retransmissions if the network loss rates ever went much
above zero.  This isn't even a congestion collapse situation, because
we aren't talking about resending unnecessarily: somewhere the data is
in fact lost.

As another aside (I think I saw this once in the context of SRM, but
it wasn't in the paper), the process of having the receivers who lost
a packet find the nearest receiver who didn't gives us a pretty good
idea where in the tree the losses might be happening.  One could
imagine applying this knowledge either to FEC or routing, although
this makes a lot more sense in an application-level multicast scheme
than it does for IP multicast.

Everything about the basic protocol centers on random backoff timeouts
(ala Ethernet) and broadcast of NACKs and retransmissions to the
entire group.  When a receiver discovers it is missing a packet (from
seeing later sequence numbers), it chooses a delay, waits that long,
and (if it hasn't seen either someone else's NACK or retransmission in
the interim) broadcasts a NACK for the missing data item.  When a
receiver receives a NACK for an item it has, it chooses a similar
delay, and then sends the retransmission if nobody else beats it to
the punch.  The timeouts are all based on min and max constants,
multiplied by the RTT between the two nodes in question (the receiver
and the sender for a NACK, and the requester and the responder for a
retransmission).


The analysis in the paper consists of showing, in various cases, the
deterministicly known or expected performance on three kinds of
standard topologies: chains, stars and trees.  They try to show the
space of the time-to-recover vs. duplicate-retransmission tradeoff as
the parameters are varied.  They then show an algorithm for tuning the
four constants on the fly, using what seems like a fairly typical
add/subtract algorithm with hard bounds on the ranges of the
constants.  They show that they are able to converge fairly quickly to
a much smaller number of duplicate retransmissions, without
significantly impacting time-to-repair.


A couple of nits on methodology and data presentation:

Note that the tree topology isn't really a cross between the star and
chain, because all the internal nodes are receivers.  Therefore there
is always a unique node closest to the bottleneck, and therefore
deterministic timeouts are always the right choice in theory.  Only if
the loss occurred immediately above some internal node that was not a
receiver (only a forwarding router not participating in recovery)
could all of its children be the same distance from the loss.

One thing that is inadequately explained is the reason, in figure 4,
that the number of repairs is so much larger than the number of
requests.  The claim is that this is a result of the distance from the
bottleneck to the detecting receiver, but I don't see why this makes
things asymmetric.


As yet another semi-related notion: they mention towards the end of
the paper layered multicast.  This is usually proposed for video,
where each layer adds quality while keeping all receivers
synchronized, but they mention offhand using the extra layers to speed
up some receivers (which sort of assumes that people are trying to
download some chunk of data and then leave).  There is actually a very
simple, and pretty-much application independent layer coding that
occurs to me, where layer 0 sends all the data in order (and then
loops, like all the layers), layer 1 sends the same data, but offset
by half the length of the stream.  Layer 2 sends at twice the 1/2
rate, sending both a 1/4 and a 3/4 offset.  Similarly layer 3 sends at
4X rate, with 1/8, 3/8, 5/8, and 7/8 offset.  I believe that any
receiver can pick up any prefix of the layers at any time, and listen
for 1/(2^l) of the stream length (where l is the top layer received)
and get all the data.


Given that IP multicast looks to be headed for the network bit bucket,
the relevance of SRM seems to be tied up with the move towards
application-level multicast.  One could imagine implementing SRM
fairly analogously in an ALM system, and you would benefit from the
elimination of the primitive star topology that makes life difficult.
Since each splitting router in an ALM system is also a receiver, we
always have the chain-style situation with a unique receiver closest
to any loss.  On the other hand, one could imagine building a simpler,
and nearly reliable, ALM system using TCP for the unicast channels,
and keeping hard state at the nodes (the hard state is the reason this
is only almost reliable), and this might turn out to be easier and
good enough.