From: Cem Paya (cemp_at_microsoft.com)
Date: Mon Mar 08 2004 - 16:08:04 PST
Paper review: P2P file-sharing workload
Cem Paya, CSE551P
Authors studied a network trace of Kazaa P2P file-sharing activity over
the span of 6 months from UW campus. Analysis turns up 3 major suprises:
very different dynamics for content popularity, clear divergence from
the expected Zipf distribution and untapped potential for locality in
the construction of P2P networks. In order to explain the observed
behavior the authors constructed a model for P2P workload, choosing
parameters based on the trace when necessary, and the model closely
approximated observed behavior in the trace.
Data collection was done on very large scale: there is a huge amount of
traffic, 20TB total representing all traffic between UW internal network
and external sites for 200 days. Interesting enough a single 1.7Ghz
machine was enough to log all this traffic to a second machine, without
dropping any packets. (Paper does not state the total bandwidth UW has
to outside world.) Hinted at but not discussed in the methodology are
privacy concerns-given the litigatious environment surrounding P2P
file-sharing, any sizable collection of data about downloading habits of
college population is prime target for discovery/subpoena. Only
comforting factor is one comment in passing to the effect that "we made
all sensitive information anonymous" but the paper is short on details.
This is important because past studies suffered from incorrect
assumptions around sanitizing, such as depending on one-way hashing when
the space of identifiers is too small to preclude brute-force attacks.
First among many surprises is that P2P users are a patient bunch-a
characteristic one does not usually ascribe to college students.
Majority of requests fail, and those that succeed can take inordinate
amounts of time. Even the web, once derided as "world wide wait" is
blazing fast in comparison. As the authors point out the web is an
interactive medium but P2P operates in batch-mode. Other insights into
the psychology of the average P2P user comes from download and
popularity trends. "New" nodes have highest request rates and then
interest gradually drops over time as the novelty of P2P wears off.
Popular objects are also an ever-evolving group, as each new piece of
content introduced into the network enjoys its 15 days of fame before
being replaced by the next fashionable item-du-jour.
Perhaps biggest surprise is the divergence from Zipf model. First
observed in the statistics of natural language, namely that the
freuqency of letters and words in English follows a probability
distribution that goes inversely as their rank, Zipf distributions crop
up everywhere in both natural and engineered systems. But not In Kazaa
workload-popular objects are not as popular, and obscure ones not as
obscure as one would expect a priori based on optimal fitting Zipf
distribution. Authors offer a very plausible explanation based on two
arguments: updates in the network are driven by new content injection
and each client downloads a given content exactly once. Both are rooted
in the same basic rule governing P2P semantics: content is immutable,
unlike web scenarios, where content associated with the given URL can be
updated. One of the ironies of the paper has been to revisit an older
data set for video rentals dating to 1991 and looking over the data
second time. It also shows same pattern of divergence from Zipf
distribution, starting off lower and dropping off slower than expected
when plotted on the log-log graph. (Bad news for the video-on-demand
papers which based their Zipf distribution assumption on this data set.)
For nervous system adminstrators watching their network utilization
perilously climb as users embrace P2P file-sharing undeterrred by the
legal risks, there is hope in section #5. Basic result is that about 86%
of the external bandwidth could have been saved by satisfying requests
from nodes inside the UW network, without involving external peers.
Kazaa itself does not have any notion of network affinity or locality,
and the paper discusses some possible ways to get same effect using
redirectors and proxies. 86% figure is somewhat optimistic and assumes
an idealized reuse scenario where any content downloaded is available
for serving at all times after that point. Next couple of section
introduce refinements where each peer is only available for times around
which they were observed engaging in activity. Still unaddressed in this
calculation is one of the biggest challenges of scaling P2P networks:
free-riding. One study of Gnutella found that majority of content is
made avialable by a very small percentage of nodes, and identified a
large chunk of "free-riders" that contribute no content. In fact given
that disabling uploads is one of the ways to limit legal liability, one
would expect more and more nodes to follow this route. In short the
relevant parameter will not be avaialability of nodes-which, assuming
desktop computer left powered on, plugged into broadband connection is
nearly 100%-- but their willingness to share.
Cem
This archive was generated by hypermail 2.1.6 : Mon Mar 08 2004 - 16:10:02 PST