Review of Thread Management Alternatives

From: Nathan Dire (ndire_at_cs.washington.edu)
Date: Sun Jan 25 2004 - 18:52:16 PST

  • Next message: Steve Arnold: "Review: Performance Implications of Thread Management"

    In "The Performance Implications of Thread Management Alternatives for
    Shared-Memory Multiprocessors", Anderson, et al, present performance
    results for thread management alternative and for spin lock management
    alternatives, showing that small changes in the structures and locking
    mechanism can have a significant impact on performance.

    For thread management, the paper explores five designs for organizing
    the data structures, along with the performance trade-offs that each
    approach makes. First is central data structures protected by a single
    lock (item A). This approach does not parallelize well, since
    processors are serialized through the single lock, which leads to the
    addition of separate locks to allow processor to lock only the data they
    need (item B). This allows more throughput by allowing for better
    parallelization, but increase latency since more locks must be acquired.
    To avoid locking, the next approach listed makes the free list of stack
    and control blocks local to each processor (item C), which trades the
    space of local lists for the time of finding free resources. To exploit
    parallelism in creating threads, the approach under (D) adds a central
    queue of idle processors, so that thread creation searches for an idle
    processor on which to run the thread. But this approach only reduces
    latency in the case that there are idle processors available. The
    approach in item E inverts the previous organization by adding
    per-processor ready queues. This leads to a problem of balance, so
    processors must be allowed to look for work on, and add work to, any
    queue.

    To evaluate these approaches, the authors implement each on a Sequent
    Symmetry Model A shared-memory multiprocessor. The performance is
    compared using a test that measures the time to create, start, and
    finish one million "null" threads. While all the approaches perform
    similarly on a few processors, the approach listed in (E), local ready
    queues, scales significantly better than the other alternatives as the
    number of processors exceeds 8.

    Given the impact of different data structure organizations on
    performance, an investigation of the basic synchronization seems in
    order, and the paper includes evaluation of three different approaches
    to spin-waiting. The first approach is looping on the xchgb
    instruction, which has the drawback of using bus resources on each test.
    An alternative to this is to do a read of the lock value instead of the
    xchgb instruction. This approach uses no bus resources while the lock
    is held, though several processors trying to acquire the same lock will
    repeatedly invalidate each other's caches. The final approach is
    Ethernet-style backoff, inspired by the algorithm used to avoid transmit
    collisions on Ethernet networks. This has the problem of fairness,
    since a process that starts waiting to acquire a lock may get it before
    one that has been waiting for a while, and it may delay the time between
    one processor releases a lock and another acquires it. As with many of
    the thread management alternatives, this approach seems to trade latency
    for throughput.

    The spin lock approaches were tested by having various numbers of
    processors cooperatively incrementing and testing a shared counter in a
    critical section 1 million times. The backoff approach performance
    performs best, both with respect to number of processors and critical
    section time. The measurements of spin-xchgb and spin-read reflect the
    intuitive observations about the behavior of the approaches, with
    spin-xchgb performing at a constant rate with respect to the critical
    section time, whereas spin-read improves as the critical sections grow
    longer.

    The authors also present a queuing model approach to analyzing the
    behavior of the approaches. The results reflect much of the more
    empirical data, showing that this form of modeling is effective for such
    complex interactions.

    The evaluations presented in this paper seems relatively limited. While
    they are designed to focus on the studied effects, I would be interested
    to see more diverse benchmarking results, both at the application level
    and the multi-processor architecture level. I think the results provide
    a compelling argument to developers of threading architecture: low-level
    design decisions have a significant impact on performance.


  • Next message: Steve Arnold: "Review: Performance Implications of Thread Management"

    This archive was generated by hypermail 2.1.6 : Sun Jan 25 2004 - 18:52:18 PST