An Analysis of Wide-Area Name Server Traffic
Danzig, Obraczka, and Kumar


The main point of this paper is to describe (or perhaps vent
frustration at) the blowup in DNS traffic due to bugs in various
server/resolver errors.  Because they find that the blowup, at about a
factor of 20, is so egregious, they end up focusing less of their
discussion on the absolute numbers and what they imply about the
future of DNS.

By contrast to the previous paper, this is based on actual observation
of requests to name four servers at USC/ISI, including one root server
(a benefit of doing the work at USC).  Beyond that it's not worth
repeating here most of the details of the data collection.  Nothing
about it leapt out as abnormal.

As is the case with most of these sorts of studies, they began by
enumerating all the odd behavior that they saw.  I always find it
impressive how much effort goes into tracking down all the oddities
and understanding them, let alone developing analyses to distinguish
the various cases.  The difference is that most studies do this so
that the buggy and abnormal cases can be removed and the remaining
behavior analyzed for trends or for models of "normal" functioning.
In this case, they end up focusing nearly all the effort on
understanding the abnormalities, and in fact end up characterizing
them by comparing to a non-measured mental model of how the system
should work.

The strength of the paper is the exhaustive categorization and listing
of the bugs they found and the effects of them on traffic load.
However, and I'll get into it below, I think that the broader sense of
the kinds of problems that you see in large systems like this is more
interesting than the bugs themselves.

The biggest weakness I saw in the paper was the section at the end
describing error detecting servers.  I think that this section
describes some excellent ideas for instrumenting a server to collect
additional data for a similar sort of study, but I'd be leery of
letting such a server run free, either wired up to an auto-emailer or
even without fairly close supervision to make sure it wasn't breaking
things.

In terms of the bugs and abnormalities themselves, in some ways the
list is less interesting than the broader notions of how they crop
up.  As I see it, there are two broad categories of errors: those that
arise from the interaction of two different server implementations
turning out to be overly generous in what they send and stingy in what
they receive (i.e. reverse from the internet motto), and those that
come out of the observation of the previous paper that programmers
will code only until the system seems to work in testing, ignoring
both failure cases and performance implications.  The first category
would include the zero answer bug and many of the recursion cases,
while things like server failure detection, name error bugs, and
failure to implement centralized caching are typical of the second.

Beyond that, this paper is a good example of the truism that, if you
do a large measurement study, anything that can happen, and an awful
lot of stuff that you thought couldn't happen, will, does, and you
will see it.  The internet is a wide and wild place, and because it is
built on protocols that usually tolerate both errors and a wide
variety of interpretations of specs, weird behavior goes unnoticed all
the time.

I think that this is probably the most relevant point today.  I have
to hope that most of the bugs that they found have been fixed, and
most of the systems running the broken versions updated, but I expect
that, if we were to repeat this study, we'd end up seeing about the
same thing, new bugs, but things still basically work (which is
something that they really gloss over in the paper; you'd think the
DNS was going to collapse within a year).