From: David Coleman (dcoleman_at_cs.washington.edu)
Date: Wed Feb 25 2004 - 14:30:23 PST
I really liked this paper. I’ve never been very enamored with
component-based systems because in practice it seems like things always
end up being too closely-coupled, the framework prevents scaling,
performance is bad, configuration management is a nightmare, etc. For
years we’ve been promised the Utopia of a system composed of small
finite components working together successfully. I believe the design
concepts of the workers could be applied very broadly for more useful
component-based systems. I also appreciated the concept of the stubs
allowing the implementation to focus on the actual task at hand.
I have several questions about the load balancing, fault tolerance and
recovery design. First, how do you spawn a control (i.e. distiller) on a
node or restart a crashed process? I’m interested in the actual
mechanics of it. Next, what if the crash (of the manager for instance)
was caused by some lower-level system corruption (such as struck the
Mars rover when the flash memory filled up and there wasn’t enough room
to boot up the system to empty the flash memory)? Are there checks in
place to identify this restart/crash cycle and attempt to break it?
My understanding of multicast/broadcast is somewhat limited, but it
seems like a very large system with many workers and many different
types of workers would end up saturating the network or at least the
manager with load updates during periods of heavy data traffic. There
are several different types of broadcasts in this system and with a
large component load, I think the system might not scale as well. This
was addressed somewhat in the paper but I wish that it had been
discussed more thoroughly.
I enjoyed the paper and the applicability to a wide variety of systems,
beyond “just” a scalable cluster-based network service.
This archive was generated by hypermail 2.1.6 : Wed Feb 25 2004 - 14:26:11 PST