Review: P2P file-sharing workload

From: Cem Paya (cemp_at_microsoft.com)
Date: Mon Mar 08 2004 - 16:08:04 PST

  • Next message: Sellakumaran Kanagarathnam: "Measurement, Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload"

     

    Paper review: P2P file-sharing workload

    Cem Paya, CSE551P

     

    Authors studied a network trace of Kazaa P2P file-sharing activity over
    the span of 6 months from UW campus. Analysis turns up 3 major suprises:
    very different dynamics for content popularity, clear divergence from
    the expected Zipf distribution and untapped potential for locality in
    the construction of P2P networks. In order to explain the observed
    behavior the authors constructed a model for P2P workload, choosing
    parameters based on the trace when necessary, and the model closely
    approximated observed behavior in the trace.

     

    Data collection was done on very large scale: there is a huge amount of
    traffic, 20TB total representing all traffic between UW internal network
    and external sites for 200 days. Interesting enough a single 1.7Ghz
    machine was enough to log all this traffic to a second machine, without
    dropping any packets. (Paper does not state the total bandwidth UW has
    to outside world.) Hinted at but not discussed in the methodology are
    privacy concerns-given the litigatious environment surrounding P2P
    file-sharing, any sizable collection of data about downloading habits of
    college population is prime target for discovery/subpoena. Only
    comforting factor is one comment in passing to the effect that "we made
    all sensitive information anonymous" but the paper is short on details.
    This is important because past studies suffered from incorrect
    assumptions around sanitizing, such as depending on one-way hashing when
    the space of identifiers is too small to preclude brute-force attacks.

     

    First among many surprises is that P2P users are a patient bunch-a
    characteristic one does not usually ascribe to college students.
    Majority of requests fail, and those that succeed can take inordinate
    amounts of time. Even the web, once derided as "world wide wait" is
    blazing fast in comparison. As the authors point out the web is an
    interactive medium but P2P operates in batch-mode. Other insights into
    the psychology of the average P2P user comes from download and
    popularity trends. "New" nodes have highest request rates and then
    interest gradually drops over time as the novelty of P2P wears off.
    Popular objects are also an ever-evolving group, as each new piece of
    content introduced into the network enjoys its 15 days of fame before
    being replaced by the next fashionable item-du-jour.

     

    Perhaps biggest surprise is the divergence from Zipf model. First
    observed in the statistics of natural language, namely that the
    freuqency of letters and words in English follows a probability
    distribution that goes inversely as their rank, Zipf distributions crop
    up everywhere in both natural and engineered systems. But not In Kazaa
    workload-popular objects are not as popular, and obscure ones not as
    obscure as one would expect a priori based on optimal fitting Zipf
    distribution. Authors offer a very plausible explanation based on two
    arguments: updates in the network are driven by new content injection
    and each client downloads a given content exactly once. Both are rooted
    in the same basic rule governing P2P semantics: content is immutable,
    unlike web scenarios, where content associated with the given URL can be
    updated. One of the ironies of the paper has been to revisit an older
    data set for video rentals dating to 1991 and looking over the data
    second time. It also shows same pattern of divergence from Zipf
    distribution, starting off lower and dropping off slower than expected
    when plotted on the log-log graph. (Bad news for the video-on-demand
    papers which based their Zipf distribution assumption on this data set.)

     

    For nervous system adminstrators watching their network utilization
    perilously climb as users embrace P2P file-sharing undeterrred by the
    legal risks, there is hope in section #5. Basic result is that about 86%
    of the external bandwidth could have been saved by satisfying requests
    from nodes inside the UW network, without involving external peers.
    Kazaa itself does not have any notion of network affinity or locality,
    and the paper discusses some possible ways to get same effect using
    redirectors and proxies. 86% figure is somewhat optimistic and assumes
    an idealized reuse scenario where any content downloaded is available
    for serving at all times after that point. Next couple of section
    introduce refinements where each peer is only available for times around
    which they were observed engaging in activity. Still unaddressed in this
    calculation is one of the biggest challenges of scaling P2P networks:
    free-riding. One study of Gnutella found that majority of content is
    made avialable by a very small percentage of nodes, and identified a
    large chunk of "free-riders" that contribute no content. In fact given
    that disabling uploads is one of the ways to limit legal liability, one
    would expect more and more nodes to follow this route. In short the
    relevant parameter will not be avaialability of nodes-which, assuming
    desktop computer left powered on, plugged into broadband connection is
    nearly 100%-- but their willingness to share.

     

    Cem

     


  • Next message: Sellakumaran Kanagarathnam: "Measurement, Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload"

    This archive was generated by hypermail 2.1.6 : Mon Mar 08 2004 - 16:10:02 PST