An Analysis of Wide-Area Name Server Traffic Danzig, Obraczka, and Kumar The main point of this paper is to describe (or perhaps vent frustration at) the blowup in DNS traffic due to bugs in various server/resolver errors. Because they find that the blowup, at about a factor of 20, is so egregious, they end up focusing less of their discussion on the absolute numbers and what they imply about the future of DNS. By contrast to the previous paper, this is based on actual observation of requests to name four servers at USC/ISI, including one root server (a benefit of doing the work at USC). Beyond that it's not worth repeating here most of the details of the data collection. Nothing about it leapt out as abnormal. As is the case with most of these sorts of studies, they began by enumerating all the odd behavior that they saw. I always find it impressive how much effort goes into tracking down all the oddities and understanding them, let alone developing analyses to distinguish the various cases. The difference is that most studies do this so that the buggy and abnormal cases can be removed and the remaining behavior analyzed for trends or for models of "normal" functioning. In this case, they end up focusing nearly all the effort on understanding the abnormalities, and in fact end up characterizing them by comparing to a non-measured mental model of how the system should work. The strength of the paper is the exhaustive categorization and listing of the bugs they found and the effects of them on traffic load. However, and I'll get into it below, I think that the broader sense of the kinds of problems that you see in large systems like this is more interesting than the bugs themselves. The biggest weakness I saw in the paper was the section at the end describing error detecting servers. I think that this section describes some excellent ideas for instrumenting a server to collect additional data for a similar sort of study, but I'd be leery of letting such a server run free, either wired up to an auto-emailer or even without fairly close supervision to make sure it wasn't breaking things. In terms of the bugs and abnormalities themselves, in some ways the list is less interesting than the broader notions of how they crop up. As I see it, there are two broad categories of errors: those that arise from the interaction of two different server implementations turning out to be overly generous in what they send and stingy in what they receive (i.e. reverse from the internet motto), and those that come out of the observation of the previous paper that programmers will code only until the system seems to work in testing, ignoring both failure cases and performance implications. The first category would include the zero answer bug and many of the recursion cases, while things like server failure detection, name error bugs, and failure to implement centralized caching are typical of the second. Beyond that, this paper is a good example of the truism that, if you do a large measurement study, anything that can happen, and an awful lot of stuff that you thought couldn't happen, will, does, and you will see it. The internet is a wide and wild place, and because it is built on protocols that usually tolerate both errors and a wide variety of interpretations of specs, weird behavior goes unnoticed all the time. I think that this is probably the most relevant point today. I have to hope that most of the bugs that they found have been fixed, and most of the systems running the broken versions updated, but I expect that, if we were to repeat this study, we'd end up seeing about the same thing, new bugs, but things still basically work (which is something that they really gloss over in the paper; you'd think the DNS was going to collapse within a year).