From: Richard Jackson (richja_at_expedia.com)
Date: Sun Jan 25 2004 - 22:23:14 PST
This 1989 paper by Anderson, Lazowska and Levy contained a very
interesting discussion of various performance issues related to
threading in a multiprocessor environment. The paper is a very good
combination of general theory, testing, and analysis of results.
The paper has four main sections. First, it discusses five different
methods of thread management and provides quantitative results for each.
Next, it focuses on the locking issues which affect each of the thread
management techniques, and three different locking methods are
discussed. Third, the paper discusses an analytic approach for modeling
thread management. Finally, the paper contains some general wisdom
about creating a successful multi-threaded system.
The five thread management techniques ranged from completely centralized
data structures to completely localized data structures with the latter
giving the best performance. The five methods were: 1) "single lock"
- centralized data with single lock mechanism, 2)"multiple locks" -
centralized data with one central lock per data type(ready queue, stack
free list, control block free list), 3) "local free list" - both free
lists are located on the local processor and are not locked, ready queue
is maintained centrally. The main issue with this approach is keeping
local free lists balanced among processors. 4) "idle queue" - again,
free lists are local, but now we keep a central queue of idle processors
which helps coordinate pre-allocation of data structures among idle
processors, and 5) "local ready queue" - again, free lists are local,
and a local version of the ready queue exists. Here, idle processors
must scan one another's queues to ensure that any ready threads are
executed. Otherwise, some processors will be busy while others are
idle. Overall, the theme of this section is that local(per processor)
control of data is beneficial to improving performance and decreasing
bottlenecks. Test analysis of these five methods generally showed that
option #5 performed the best.
The section on locking issues addressed three different approaches to
spin-locking. These were: 1) "spin on xchgb" - Xchgb is the
multi-processor's basic test/set primitive. This was the busiest form
of waiting as the bus is heavily used and processors are in constant
contention with each other , 2) "spin on read" - this assumes a cache
coherent memory system, and essentially reads the cached lock value
until it becomes available, and then tries to request the lock via
xchgb. Overall, this method is better than #1, but can still lead to
periods of bus contention as many processors simultaneously try to
request the same lock. 3) "Ethernet-style backoff" - here, a simple
scheme is used to delay the retry of each lock request, based on the
number of previous failures that occurred. A few variations of this
method are discussed. It is clear that this method is the best of the
three that were discussed.
The section on analytic modeling provided an alternate method for
predicting the performance of each of the thread management approaches.
Once this model was developed, the authors compared it to the tested
results and found that their model was fairly accurate. I found this
section to be somewhat vague and confusing, especially when compared to
all the earlier parts of the paper which were very good. I think this
section could have been excluded.
Finally, the paper included some general comments about designing a
thread system. Some of them included: use per-processor data
structures to improve throughput, simplicity(small number of
instructions) is key to ensure performance, the fundamental
tradeoff(among many mentioned) is between throughput and latency.
This archive was generated by hypermail 2.1.6 : Sun Jan 25 2004 - 22:23:26 PST