From: Alexander G Balikov (alexgb_at_u.washington.edu)
Date: Mon Feb 02 2004 - 17:41:34 PST
In this papaer the authors present a review of the problems encountered during running Grapevine - a distributed system for message delivery, name resolution and authentication. The paper was very interesting to me, because I think the problems described are very relevant even today and designing such distributed system is still a cutting edge technology. The authors describe the problems they encounter with scaling the system.
The system had to scale out, had to support reliable message delivery for growing user population over a network of unreliable and slow links. As the system matured a new usage patterns appeared which were not foreseen and caused new set of problems - from message delivery system, it was evolving into message storage (essentially a contemporary mail server). Recovery algorithms could actually lead to chain of failures and drive the entire system to a halt, even though those algorithms were designed to provide system stability.
It was interesting to note the experieneces I am seeing today - once a system becomes accepted and widely used, new unforseen problems start to surface, but it becomes more and more difficult to fix them because of fear of destabilization. This illustrates the need of very good upfront design.
The second problem is that by trying to make a system hide its internal structure, you actually may simply lead to more confusion - the Grapevine system was designed to hide the details of its distributed nature and load balancing from the end users. Still, this lead to delays in update propagation and consecuently into seemingly lost updates or duplicate message delivery.
It was also interesting to note the facilities for remote administration and supervision built into the system.
This archive was generated by hypermail 2.1.6 : Mon Feb 02 2004 - 17:41:36 PST