From: shearerje_at_comcast.net
Date: Sun Jan 25 2004 - 13:46:53 PST
“The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors” defines and compares a variety of different implementations for assigning threads to processors in a shared-memory multiprocessor machine. The actual experiments were conducted on a Sequent Symmetry Model A machine with up to 20 active processors, but alternatives were briefly discussed. Thread management implementations were evaluated for their impact on “latency”, which is the cost of thread management when there is no contention for locks (E.g., many more processors than threads), and “Throughput”, which is the rate at which threads can be created, started, and finished when there is contention (E.g., many more threads than processors).
Thread management implementations must address implementation and access control of several different resources including: Data Structures (single lock covering all data structures vs. a separate lock for each data structure), Ready Queues of threads that are waiting for a processor (single dispatch queue vs. queue per processor), Free Lists of available data structure space (none vs. centralized, vs. per processor vs. global pool hybrid), and an Idle Queue of processors waiting for work (none vs. centralized). Five overall architectures were considered which mixed and matched these elements in various ways. Queuing models and experimental measurements showed that applications comprised of many threads (fine-grained parallelism) running on machines with many processors are throughput limited either by waiting for locks on the afore-mentioned resources, or by the bandwidth of the bus. Different lock waiting techniques were tried including spin-waiting, block-and-relinquish, and a back-off strategy. The pa
per shows that while some techniques seem universally bad (E.g., single data structure lock and block-and-relinquish), the best technique varies depending on characteristics of the load, including thread starving, processor starving, and average thread life from creation to destruction (E.g., how much work each thread performs before completing).
It turns out that lock and bus contention are both very sensitive to the selection of resource management methods, and small changes in the complexity of the implementation of these methods. A few extra cross-bus calls or a few extra instructions in the lock management code can choke the whole system. This became a factor when the authors considered using a hybrid of techniques to combine advantages. The added complexity may outweigh the benefits. Curiously, the authors did not consider a model that, rather than being a static hybrid, switched between models at various loading levels. If this (obviously expensive) switching did not happen very often, then the machine should be able to spend more time operating in the best model for the load.
This archive was generated by hypermail 2.1.6 : Sun Jan 25 2004 - 13:46:59 PST