Gummandi et al. "Measurement Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload"

From: Cliff Schmidt (cliff_at_bea.com)
Date: Mon Mar 08 2004 - 01:32:20 PST

  • Next message: Nathan Dire: "P2P Measurement Review"

    This paper presents an analysis of a 200-day trace of UW Kazaa P2P
    traffic, describes a model to represent the observations from the trace,
    and explores the potential impact of locality awareness in Kazaa.

    Some of the key observations about the extensive trace of Kazaa traffic
    were:
    - Unlike "Web traffic", the users are willing to wait an hour in most
    cases (typically audio files), and even many hours or days in other
    cases (typically video files). Users treat this sort of P2P file
    sharing more as a batch process, rather than an interactive session,
    where users may get impatience after only a few seconds of delay.

    - Older clients tend to use the system less than newer ones. This
    is partly due to attrition (clients appearing to permanently leave the
    network), but also due to older clients actually asking for less in
    each interaction.

    - Most transactions fail (about two-thirds), leading to many
    transactions per user request.

    - Clients have very short lifetimes, many leaving the system after
    only one request. However, I expect that this will occur less
    frequently as this technology becomes more commonly used.

    - The Kazaa workload appears to be a blend of two types of objects:
    small objects of < 10MB, which tends to be the upper bound for audio
    files, and large objects > 100MB, which tends to be the lower bound
    for full-length movies. The authors recognized this distinction and
    proceeded to analyze these two object classes separately.

    - Finally, the most significant finding relative to other studies of
    Internet content traffic was that clients typically (94% of the time)
    fetch an object at most once. This is far from the typical behavior
    of Web clients, which the authors measured at 57% being at most once.
    Associated with this finding is that the popular objects are often
    short lived, and that popular objects tend to have been recently born.
    This is the key finding of this paper that drives the design of the
    model and caching proposals.

    Before discussing their model, the authors explore the idea that
    "at most once" request models do not exhibit the typical Zipf
    distribution curve, instead giving a flatter head. One of the parts
    of this discussion that I thought was surprising was demonstrating
    that a previous paper (the A.Dan et al. paper, "Scheduling policies
    for an on-demand video server with batching") incorrectly
    categorized video distribution popularity as being Zipf-like as the
    result of observing the relatively similar curves when plotted on a
    linear scale. Gummadi et al. show that, when this curve is plotted
    on a log-log scale, it is in fact better categorized as a fattened-
    head Zipf distribution, similar to what they observed with Kazaa
    traffic.

    I was impressed with how closely their model matched their Kazaa
    observations. Having built an accurate model, the authors were then
    able to adjust other factors to observe the effect. For instance,
    they found that a shared proxy cache would be little help for such
    a system if there were no new objects being introduced to the
    system. The reason for this is that once the clients download the
    most popular items once, their future downloads of non-popular
    objects is widely distributed and not a good candidate for caching.
    They authors observe that "the decrease in hit-rate over time is a
    strong property of fetch-at-most-once behavior." However, they
    then showed that introducing a relatively small rate of new objects
    into the system made caching effective, but introducing new clients
    into the system (which would benefit from a cache during their one-
    time fetch of the most popular files) had a much more limited impact
    on cache hit rate.

    The last section of the paper suggests a redirector service, which
    could locate clients on the local network that had a copy of another
    client's requested file. They propose this as an alternative to a
    proxy cache, since there could be legal issues with an organization
    caching copyrighted multimedia files. While I found this analysis
    interesting (especially the part about how good caching could be
    with a very limited availability, and how much better it could be
    with marginal improvements of only the most available clients), I
    didn't really understand why this was a better legal answer than
    using a proxy cache. Wasn't Napster shut down, because it had a
    central server that pointed to illegal copies of media? Wouldn't
    a redirector be a similar mechanism? So, while the decentralized
    proposal might work, the centralized redirector approach didn't
    make much sense to me.

    One last thought: what does is really mean to talk about locality?
    In this study, they note significant locality in Kazaa workload;
    while it is true that the clients that were studies did tend to
    request the same popular files, there was no discussion of how this
    group's requests differed from the requests made at another
    university or geographically-local home users, or business-based
    clients. In other words, is "locality" the correct word if you
    have only shown similarity of requests within one sampled group
    without having shown that these requests were different from the
    ones by any other group of clients in the network?


  • Next message: Nathan Dire: "P2P Measurement Review"

    This archive was generated by hypermail 2.1.6 : Mon Mar 08 2004 - 01:32:23 PST