From: Chuck Reeves (creeves_at_windows.microsoft.com)
Date: Mon Jan 26 2004 - 10:51:22 PST
The paper "The Performance Implications of Thread Management
Alternatives for Shared-Memory Multiprocessors" was written by a number
of researchers from the University of Washington around 1989. It
provides an analysis of the design and performance of a number of
algorithms for thread management in shared-memory multiprocessors. One
recurrent theme in the paper is that small changes in the thread
management execution path can have dramatic affects on the performance
and scalability of multiprocessors. Section 3 of the document outlines a
number of high-level approaches to the management and synchronization of
access to data structures associated with thread management. The
distinctions between these alternative primarily focus around the
de-centralization of ready queues and free lists. This approach allows
execution on each processor to proceed more independently. I found the
description of how idle processor queues could be used to improve the
performance of thread creation interesting.
The performance measurement results in section 3F emphasize the benefits
of using local ready queues. While the million null thread test was
interesting, the linear scalability shown in Figure 3 (where the thread
actually did something) for local ready queues is the most compelling.
The performance of the per-processor ready queue design did have
identify one problem. When the number of threads is near the number of
processors, the latency in starting new threads seems to rise
exponentially. The authors analyze this problem in some detail, but I
don't think I understood their explanation very well.
Section 4 analyzes the costs of spin-locks and bus contention
experienced when multiple processors require access to critical
resources. The description of how the xchgb instruction works was quite
well written.
I was surprised by the behavior of processor cache on the Symmetry Model
xchgb instruction. I don't understand why a failed execution of this
instruction requires an invalidation of the cache value for all
processors.
The back-off algorithm describes in section 4d seemed like a logical
approach. But the loss of execution order seemed like a contentious
trade-off where a processor could experience starvation in high-stress
conditions. The queueing mechanism described in section 4f seems like a
logical way of balancing out this problem, I'm surprised it wasn't
mentioned in the context of the original problem.
I had a difficult time understanding the analytical models presented in
the final section. I may not have understood the syntax but the diagrams
did not seem to directly imply any mathmatical behavior.
This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 10:51:49 PST