Review:The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors.

From: Manish Mittal (manishm_at_microsoft.com)
Date: Mon Jan 26 2004 - 12:14:51 PST

  • Next message: Justin Voskuhl: "Review for "The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors""

    This paper explores various thread management alternatives and studies
    the performance implications of these alternatives on shared memory
    multiprocessors. Threads on uniprocessors are used as a program
    structuring aid or to overlap IO with processing. Threads on
    multiprocessors are used to exploit parallelism. In a multiprocessing
    shared environment, shared data structures must be serialized to ensure
    consistency and correctness. This is done through locking, which brings
    down the rate of threads running to completion (throughput) and
    increases the time spent in thread management (latency). The idea is to
    maximize throughput and minimize latency. The majority of the article
    discusses thread management techniques, the costs associated with each
    technique and measurement results based on the tests carried out on
    Sequent Symmetry shared memory processor. Paper then compares blocking
    with spin-waiting and describes various approaches for spin-waiting.

    Five thread management alternatives are listed along with their
    advantages and disadvantages: single lock, multiple locks, local free
    list, idle queue and local ready queue. The single lock approach is
    simple but the throughput is limited. The multiple locks in which each
    data structure (free list, ready queue and control block list) has its
    own lock improves the throughput but results in increasing latency. The
    tradeoff between latency and throughput can be avoided by using local
    free list. To balance free lists, Global pool is used. Local free lists
    trade space for time, which is acceptable for control blocks but not
    practical for stacks. Idle queue technique exploits parallelism in
    creating thread. It introduces central queue of Idle processors besides
    a queue of threads. The principle of this approach is work searches for
    processors instead of processors searching for work. In local ready
    queue algorithm, each processor has its own ready queue so enqueueing
    and dequeueing threads can occur simultaneously.

    Measurement results shows that: adding a single lock mechanism in the
    thread management increases latency significantly, Multiple locks
    results in lower performance compared to single lock, using per
    processors data structures to avoid locking is essential to reducing
    latency, bus contention lowers throughput for local ready queue
    alternative even though there is no lock contention and idle queue is
    faster when there are idle processors but slower when there are more
    runnable threads than processors. Another advantage of Local ready queue
    is that the Cache locality is maintained.

    The paper also covers three different approaches for spin-waiting: spin
    on Xchgb (exchange bytes), spin on read and ethernet-style backoff.
    Processor loops on the xchgb instruction in spin on xchgb approach. The
    Xchgb instruction consumes bus resources thus slowing down the holder of
    the lock. In spin on read, the processor can spin reading the lock
    memory location until the lock is released. A problem can arise where
    there are number of processors waiting for small critical section.
    However it is insignificant with long critical sections. The
    ethernet-style backoff makes each processor wait a period of time
    dependent on the number of past failures. Sometimes this lock mechanism
    is unfair and it takes longer for a spinning processor to get a new free
    lock. When there are large numbers of spinning processors the bus can be
    saturated, slowing down the processor executing in the critical section.
    Some of the optimization techniques described in the initial section is
    worth mentioning such as copying thread's argument into its control
    block when the thread is created to avoid stack allocation until startup
    or to store deallocated control blocks & stacks in free list for reuse.
    Without these optimizations, it would be difficult to gauge the benefits
    of each thread management techniques based on the performance
    statistics. This paper was very easy and interesting to read. Techniques
    used to test the performance of various thread alternatives is explained
    very well (with the code snippets). The implementation and analysis
    section was thorough. In the end, author concludes that simplicity is
    crucial to implement a fast thread package.

     


  • Next message: Justin Voskuhl: "Review for "The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors""

    This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 12:14:56 PST