From: Cliff Schmidt (cliff_at_bea.com)
Date: Mon Mar 08 2004 - 01:32:20 PST
This paper presents an analysis of a 200-day trace of UW Kazaa P2P
traffic, describes a model to represent the observations from the trace,
and explores the potential impact of locality awareness in Kazaa.
Some of the key observations about the extensive trace of Kazaa traffic
were:
- Unlike "Web traffic", the users are willing to wait an hour in most
cases (typically audio files), and even many hours or days in other
cases (typically video files). Users treat this sort of P2P file
sharing more as a batch process, rather than an interactive session,
where users may get impatience after only a few seconds of delay.
- Older clients tend to use the system less than newer ones. This
is partly due to attrition (clients appearing to permanently leave the
network), but also due to older clients actually asking for less in
each interaction.
- Most transactions fail (about two-thirds), leading to many
transactions per user request.
- Clients have very short lifetimes, many leaving the system after
only one request. However, I expect that this will occur less
frequently as this technology becomes more commonly used.
- The Kazaa workload appears to be a blend of two types of objects:
small objects of < 10MB, which tends to be the upper bound for audio
files, and large objects > 100MB, which tends to be the lower bound
for full-length movies. The authors recognized this distinction and
proceeded to analyze these two object classes separately.
- Finally, the most significant finding relative to other studies of
Internet content traffic was that clients typically (94% of the time)
fetch an object at most once. This is far from the typical behavior
of Web clients, which the authors measured at 57% being at most once.
Associated with this finding is that the popular objects are often
short lived, and that popular objects tend to have been recently born.
This is the key finding of this paper that drives the design of the
model and caching proposals.
Before discussing their model, the authors explore the idea that
"at most once" request models do not exhibit the typical Zipf
distribution curve, instead giving a flatter head. One of the parts
of this discussion that I thought was surprising was demonstrating
that a previous paper (the A.Dan et al. paper, "Scheduling policies
for an on-demand video server with batching") incorrectly
categorized video distribution popularity as being Zipf-like as the
result of observing the relatively similar curves when plotted on a
linear scale. Gummadi et al. show that, when this curve is plotted
on a log-log scale, it is in fact better categorized as a fattened-
head Zipf distribution, similar to what they observed with Kazaa
traffic.
I was impressed with how closely their model matched their Kazaa
observations. Having built an accurate model, the authors were then
able to adjust other factors to observe the effect. For instance,
they found that a shared proxy cache would be little help for such
a system if there were no new objects being introduced to the
system. The reason for this is that once the clients download the
most popular items once, their future downloads of non-popular
objects is widely distributed and not a good candidate for caching.
They authors observe that "the decrease in hit-rate over time is a
strong property of fetch-at-most-once behavior." However, they
then showed that introducing a relatively small rate of new objects
into the system made caching effective, but introducing new clients
into the system (which would benefit from a cache during their one-
time fetch of the most popular files) had a much more limited impact
on cache hit rate.
The last section of the paper suggests a redirector service, which
could locate clients on the local network that had a copy of another
client's requested file. They propose this as an alternative to a
proxy cache, since there could be legal issues with an organization
caching copyrighted multimedia files. While I found this analysis
interesting (especially the part about how good caching could be
with a very limited availability, and how much better it could be
with marginal improvements of only the most available clients), I
didn't really understand why this was a better legal answer than
using a proxy cache. Wasn't Napster shut down, because it had a
central server that pointed to illegal copies of media? Wouldn't
a redirector be a similar mechanism? So, while the decentralized
proposal might work, the centralized redirector approach didn't
make much sense to me.
One last thought: what does is really mean to talk about locality?
In this study, they note significant locality in Kazaa workload;
while it is true that the clients that were studies did tend to
request the same popular files, there was no discussion of how this
group's requests differed from the requests made at another
university or geographically-local home users, or business-based
clients. In other words, is "locality" the correct word if you
have only shown similarity of requests within one sampled group
without having shown that these requests were different from the
ones by any other group of clients in the network?
This archive was generated by hypermail 2.1.6 : Mon Mar 08 2004 - 01:32:23 PST