Review: Thomas Anderson, et al. The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors.

From: Richard Jackson (richja_at_expedia.com)
Date: Sun Jan 25 2004 - 22:23:14 PST

  • Next message: Song Xue: "Review of "The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors""

    This 1989 paper by Anderson, Lazowska and Levy contained a very
    interesting discussion of various performance issues related to
    threading in a multiprocessor environment. The paper is a very good
    combination of general theory, testing, and analysis of results.
     
    The paper has four main sections. First, it discusses five different
    methods of thread management and provides quantitative results for each.
    Next, it focuses on the locking issues which affect each of the thread
    management techniques, and three different locking methods are
    discussed. Third, the paper discusses an analytic approach for modeling
    thread management. Finally, the paper contains some general wisdom
    about creating a successful multi-threaded system.
     
    The five thread management techniques ranged from completely centralized
    data structures to completely localized data structures with the latter
    giving the best performance. The five methods were: 1) "single lock"
    - centralized data with single lock mechanism, 2)"multiple locks" -
    centralized data with one central lock per data type(ready queue, stack
    free list, control block free list), 3) "local free list" - both free
    lists are located on the local processor and are not locked, ready queue
    is maintained centrally. The main issue with this approach is keeping
    local free lists balanced among processors. 4) "idle queue" - again,
    free lists are local, but now we keep a central queue of idle processors
    which helps coordinate pre-allocation of data structures among idle
    processors, and 5) "local ready queue" - again, free lists are local,
    and a local version of the ready queue exists. Here, idle processors
    must scan one another's queues to ensure that any ready threads are
    executed. Otherwise, some processors will be busy while others are
    idle. Overall, the theme of this section is that local(per processor)
    control of data is beneficial to improving performance and decreasing
    bottlenecks. Test analysis of these five methods generally showed that
    option #5 performed the best.
     
    The section on locking issues addressed three different approaches to
    spin-locking. These were: 1) "spin on xchgb" - Xchgb is the
    multi-processor's basic test/set primitive. This was the busiest form
    of waiting as the bus is heavily used and processors are in constant
    contention with each other , 2) "spin on read" - this assumes a cache
    coherent memory system, and essentially reads the cached lock value
    until it becomes available, and then tries to request the lock via
    xchgb. Overall, this method is better than #1, but can still lead to
    periods of bus contention as many processors simultaneously try to
    request the same lock. 3) "Ethernet-style backoff" - here, a simple
    scheme is used to delay the retry of each lock request, based on the
    number of previous failures that occurred. A few variations of this
    method are discussed. It is clear that this method is the best of the
    three that were discussed.
     
    The section on analytic modeling provided an alternate method for
    predicting the performance of each of the thread management approaches.
    Once this model was developed, the authors compared it to the tested
    results and found that their model was fairly accurate. I found this
    section to be somewhat vague and confusing, especially when compared to
    all the earlier parts of the paper which were very good. I think this
    section could have been excluded.
     
    Finally, the paper included some general comments about designing a
    thread system. Some of them included: use per-processor data
    structures to improve throughput, simplicity(small number of
    instructions) is key to ensure performance, the fundamental
    tradeoff(among many mentioned) is between throughput and latency.
     


  • Next message: Song Xue: "Review of "The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors""

    This archive was generated by hypermail 2.1.6 : Sun Jan 25 2004 - 22:23:26 PST