A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing Floyd, Jacobson, Liu, McCanne, and Zhang I will start by saying that this paper (which I hadn't read before, only some of the other papers on SRM), is much more upfront about the number one thing that has always annoyed me about SRM: that it is a toolkit and a (family of) algorithms, and not a component or layer on its own. Yes, there are often good reasons to tie the application tightly into a network protocol like this (this is the ALF argument, which they cite, but for which this still seems a bit of a stretch), but there are also lots of good reasons not to, and I think that most programmers would rather trade performance for a simple, easy-to-describe interface and a well-encapsulated implementation. Now, if someone could more properly show that there *isn't* a reasonable way to encapsulate reliability (or for that matter ordering or congestion control) in a multicast world, then that would be something else. Can we come up with a defensible understanding of what makes multicast different from unicast in this regard (since it doesn't seem like the unicast equivlent, TCP, was nearly so hard to figure out or get pretty much right, particularly if you aren't so worried about congestion control). I would point out here, and largely ignore the issue from here on, that SRM only looks at one of the TCP three, and totally ignores both ordering (saying only that by using ALF applications will be able to talk about out-of-order data using application names) and congestion control (wb just fixes a per-sender bandwidth, and gets away with it because the drawing traffic is small potatoes). Given that this is already a framework, I don't see any problem with it. The only reason the three are tied together in TCP is to make it easier to specify which combination you want (and you can't argue that "any color you want, as long as it's black" isn't *easy*) and because all three use information gleaned from the same stream of acks, and a layering model doesn't want to pass that dense information around between layers. As a final rant on the subject, since SRM is really a toolkit as much as a protocol, it would be better if there was a better description of how the toolkit interface works. Otherwise it's hard to evaluate the work from the programmer's point of view: how different is building an SRM application from building a TCP application? Stepping down to the protocol level, the issue that annoys me most is the arguable reliance on every receiver knowing their RTT not only to the sender, but to every other receiver as well (modulo the hierarchical session messages sketched at the end). The notion of exchanging messages (albeit at a lower rate than the data itself) between all pairs seems in opposition to many of the goals of IP multicast, both in terms of scalability and anonymity of receivers. I think it's significant to note that wb, the only implementation they describe, does not implement this, and simply drops a constant of 30 seconds into the places in the formula where the RTT would go. They don't analyze the case where the RTT is stubbed out this way, and I have to imagine that it degrades the theoretical performance substantially. Whew. Okay, I suppose I ought to stop ranting and go over the basic technique. As an aside, I will likely use "bottleneck" interchangeably with "link (or output port) where the loss happened," because that's the way I think about it. The idea is to shift the burden of retransmission from the original sender of the data to the receivers close to where the loss occurred. Note that this is not the same as either active networks or link-level retransmission, because when we say "receivers near the bottleneck" we do *not* mean "routers near the bottleneck." If there isn't a receiver immediately after the bottleneck so be it, there is always a "nearest" even if it isn't very near. The potential upside of this is actually quite significant: if the loss happens far away from the sender, and close to a number of receivers, it is likely that some receiver who did receive the missing data is substantially closer to those missing the data than the original sender. Therefore, in the best case we might be able to fill in the hole in less than a full round-trip time. In practice, however, they never actually manage this, and it's easy to imagine that the slop time needed in the timeouts will always exceed the typical RTTs. But the real advantage of querying nearby receivers to find a copy of the data is that we don't need to involve the original sender in recovery at all. This is important, both to reduce state and computation at the sender, and because it's easy to construct situations where even a small loss rate at each receiver adds up to nearly every packet being lost somewhere in the tree. If all recovery went through the sender then most of its work would likely end up going to retransmissions if the network loss rates ever went much above zero. This isn't even a congestion collapse situation, because we aren't talking about resending unnecessarily: somewhere the data is in fact lost. As another aside (I think I saw this once in the context of SRM, but it wasn't in the paper), the process of having the receivers who lost a packet find the nearest receiver who didn't gives us a pretty good idea where in the tree the losses might be happening. One could imagine applying this knowledge either to FEC or routing, although this makes a lot more sense in an application-level multicast scheme than it does for IP multicast. Everything about the basic protocol centers on random backoff timeouts (ala Ethernet) and broadcast of NACKs and retransmissions to the entire group. When a receiver discovers it is missing a packet (from seeing later sequence numbers), it chooses a delay, waits that long, and (if it hasn't seen either someone else's NACK or retransmission in the interim) broadcasts a NACK for the missing data item. When a receiver receives a NACK for an item it has, it chooses a similar delay, and then sends the retransmission if nobody else beats it to the punch. The timeouts are all based on min and max constants, multiplied by the RTT between the two nodes in question (the receiver and the sender for a NACK, and the requester and the responder for a retransmission). The analysis in the paper consists of showing, in various cases, the deterministicly known or expected performance on three kinds of standard topologies: chains, stars and trees. They try to show the space of the time-to-recover vs. duplicate-retransmission tradeoff as the parameters are varied. They then show an algorithm for tuning the four constants on the fly, using what seems like a fairly typical add/subtract algorithm with hard bounds on the ranges of the constants. They show that they are able to converge fairly quickly to a much smaller number of duplicate retransmissions, without significantly impacting time-to-repair. A couple of nits on methodology and data presentation: Note that the tree topology isn't really a cross between the star and chain, because all the internal nodes are receivers. Therefore there is always a unique node closest to the bottleneck, and therefore deterministic timeouts are always the right choice in theory. Only if the loss occurred immediately above some internal node that was not a receiver (only a forwarding router not participating in recovery) could all of its children be the same distance from the loss. One thing that is inadequately explained is the reason, in figure 4, that the number of repairs is so much larger than the number of requests. The claim is that this is a result of the distance from the bottleneck to the detecting receiver, but I don't see why this makes things asymmetric. As yet another semi-related notion: they mention towards the end of the paper layered multicast. This is usually proposed for video, where each layer adds quality while keeping all receivers synchronized, but they mention offhand using the extra layers to speed up some receivers (which sort of assumes that people are trying to download some chunk of data and then leave). There is actually a very simple, and pretty-much application independent layer coding that occurs to me, where layer 0 sends all the data in order (and then loops, like all the layers), layer 1 sends the same data, but offset by half the length of the stream. Layer 2 sends at twice the 1/2 rate, sending both a 1/4 and a 3/4 offset. Similarly layer 3 sends at 4X rate, with 1/8, 3/8, 5/8, and 7/8 offset. I believe that any receiver can pick up any prefix of the layers at any time, and listen for 1/(2^l) of the stream length (where l is the top layer received) and get all the data. Given that IP multicast looks to be headed for the network bit bucket, the relevance of SRM seems to be tied up with the move towards application-level multicast. One could imagine implementing SRM fairly analogously in an ALM system, and you would benefit from the elimination of the primitive star topology that makes life difficult. Since each splitting router in an ALM system is also a receiver, we always have the chain-style situation with a unique receiver closest to any loss. On the other hand, one could imagine building a simpler, and nearly reliable, ALM system using TCP for the unicast channels, and keeping hard state at the nodes (the hard state is the reason this is only almost reliable), and this might turn out to be easier and good enough.