The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors

From: Chuck Reeves (creeves_at_windows.microsoft.com)
Date: Mon Jan 26 2004 - 10:51:22 PST

  • Next message: Gail Rahn: "Review of Anderson, Lazowska and Levy"

    The paper "The Performance Implications of Thread Management
    Alternatives for Shared-Memory Multiprocessors" was written by a number
    of researchers from the University of Washington around 1989. It
    provides an analysis of the design and performance of a number of
    algorithms for thread management in shared-memory multiprocessors. One
    recurrent theme in the paper is that small changes in the thread
    management execution path can have dramatic affects on the performance
    and scalability of multiprocessors. Section 3 of the document outlines a
    number of high-level approaches to the management and synchronization of
    access to data structures associated with thread management. The
    distinctions between these alternative primarily focus around the
    de-centralization of ready queues and free lists. This approach allows
    execution on each processor to proceed more independently. I found the
    description of how idle processor queues could be used to improve the
    performance of thread creation interesting.
    The performance measurement results in section 3F emphasize the benefits
    of using local ready queues. While the million null thread test was
    interesting, the linear scalability shown in Figure 3 (where the thread
    actually did something) for local ready queues is the most compelling.
    The performance of the per-processor ready queue design did have
    identify one problem. When the number of threads is near the number of
    processors, the latency in starting new threads seems to rise
    exponentially. The authors analyze this problem in some detail, but I
    don't think I understood their explanation very well.
    Section 4 analyzes the costs of spin-locks and bus contention
    experienced when multiple processors require access to critical
    resources. The description of how the xchgb instruction works was quite
    well written.

    I was surprised by the behavior of processor cache on the Symmetry Model
    xchgb instruction. I don't understand why a failed execution of this
    instruction requires an invalidation of the cache value for all
    processors.
    The back-off algorithm describes in section 4d seemed like a logical
    approach. But the loss of execution order seemed like a contentious
    trade-off where a processor could experience starvation in high-stress
    conditions. The queueing mechanism described in section 4f seems like a
    logical way of balancing out this problem, I'm surprised it wasn't
    mentioned in the context of the original problem.
    I had a difficult time understanding the analytical models presented in
    the final section. I may not have understood the syntax but the diagrams
    did not seem to directly imply any mathmatical behavior.


  • Next message: Gail Rahn: "Review of Anderson, Lazowska and Levy"

    This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 10:51:49 PST