From: shearerje_at_comcast.net
Date: Thu Mar 04 2004 - 23:21:23 PST
This extremely data-rich paper convincingly argues for the existence of several traits regarding real-world use of peer-to-peer accessing of large files: (1) Most of the data being accessed is immutable (e.g., music and movies). (2) Because the data is immutable, it is typically fetched at most once per client. (3) Because the data is fetched at most once per client, it’s access doesn’t follow a Zipf distribution. (4) Because the data is fetched in batch mode, clients can afford to be incredibly patient – on the order of days. (5) New clients request more files per unit time than older clients. New clients are initializing their libraries. (6) Old clients access new content, not old content. They already have the old content and they’re keeping their libraries up to date. (7) The size of requested objects varies by several orders of magnitude and the distribution of object sizes is small-size-loaded such that the majority of requests are for small objects but the majority of data transferred is in the
large objects. (8) Popular small objects loose their popularity faster than popular large objects. (9) A whopping 66.2% of transactions fail, usually returning HTTP 500 “Unexpected Condition”. I may have missed a few – the paper covers a lot of ground.
There were some places where I took issue with the analyses, however, as I will describe in the following paragraphs.
Figure 6 is flawed. There are a fixed number of clients, each constrained to make 1000 UNIQUE requests. Thus, it is not surprising that (1) the 100 or so most popular sites are hit exactly once by each of the clients (market saturation), causing the flat top to the profile, and (2) each client is constrained to make 900 more requests, artificially pushing the sloped part of the curve to the right. Thus, the fact that this curve has a flat top and bulges to the right is not very interesting. In real life both the client arrival rate and the client departure rate are non-zero.
Section 3.2.1 claims “Although proxy cache miss distributions may visually resemble our measured Kazaa popularity distributions, we believe the two workloads diverge from Zipf for different reasons.” No, clearly the users ARE proxy caches. They download a song once, play it as requested (by them or someone else) over and over, and exhibit cache-miss behavior when they want a song they don’t already have in their collection. This is exactly the behavior of proxy caches getting web pages.
Section 4.4 is also not convincing because, while it looks at the impact of new clients, it doesn’t account for old clients leaving the system. Thus, the total population of clients increases and ages (kind of like social security). It would be interesting to see a steady state simulation where clients retire at the same rate they join.
Section 5.1 claims that file sharing caches are problematic for legal and political reasons, and then section 5.3 proposes data replication. You can’t have it both ways. If the University gateway actively replicates the incoming data (as opposed to pointing to someone else who already has replicated the data), then the University is responsible for piracy whether the gateway caches the data locally or places it on some poor unsuspecting student’s hard drive.
A final whine... For those of us that don’t remember our statistics so well, it wouldn’t have hurt to spell out Cumulative Distribution Function once before using CDF all over the document.
In conclusion, This was a really informative paper. I had to dig pretty deep to find issues with it.
This archive was generated by hypermail 2.1.6 : Thu Mar 04 2004 - 23:21:29 PST