From: Nathan Dire (ndire_at_cs.washington.edu)
Date: Sun Jan 25 2004 - 18:52:16 PST
In "The Performance Implications of Thread Management Alternatives for
Shared-Memory Multiprocessors", Anderson, et al, present performance
results for thread management alternative and for spin lock management
alternatives, showing that small changes in the structures and locking
mechanism can have a significant impact on performance.
For thread management, the paper explores five designs for organizing
the data structures, along with the performance trade-offs that each
approach makes. First is central data structures protected by a single
lock (item A). This approach does not parallelize well, since
processors are serialized through the single lock, which leads to the
addition of separate locks to allow processor to lock only the data they
need (item B). This allows more throughput by allowing for better
parallelization, but increase latency since more locks must be acquired.
To avoid locking, the next approach listed makes the free list of stack
and control blocks local to each processor (item C), which trades the
space of local lists for the time of finding free resources. To exploit
parallelism in creating threads, the approach under (D) adds a central
queue of idle processors, so that thread creation searches for an idle
processor on which to run the thread. But this approach only reduces
latency in the case that there are idle processors available. The
approach in item E inverts the previous organization by adding
per-processor ready queues. This leads to a problem of balance, so
processors must be allowed to look for work on, and add work to, any
queue.
To evaluate these approaches, the authors implement each on a Sequent
Symmetry Model A shared-memory multiprocessor. The performance is
compared using a test that measures the time to create, start, and
finish one million "null" threads. While all the approaches perform
similarly on a few processors, the approach listed in (E), local ready
queues, scales significantly better than the other alternatives as the
number of processors exceeds 8.
Given the impact of different data structure organizations on
performance, an investigation of the basic synchronization seems in
order, and the paper includes evaluation of three different approaches
to spin-waiting. The first approach is looping on the xchgb
instruction, which has the drawback of using bus resources on each test.
An alternative to this is to do a read of the lock value instead of the
xchgb instruction. This approach uses no bus resources while the lock
is held, though several processors trying to acquire the same lock will
repeatedly invalidate each other's caches. The final approach is
Ethernet-style backoff, inspired by the algorithm used to avoid transmit
collisions on Ethernet networks. This has the problem of fairness,
since a process that starts waiting to acquire a lock may get it before
one that has been waiting for a while, and it may delay the time between
one processor releases a lock and another acquires it. As with many of
the thread management alternatives, this approach seems to trade latency
for throughput.
The spin lock approaches were tested by having various numbers of
processors cooperatively incrementing and testing a shared counter in a
critical section 1 million times. The backoff approach performance
performs best, both with respect to number of processors and critical
section time. The measurements of spin-xchgb and spin-read reflect the
intuitive observations about the behavior of the approaches, with
spin-xchgb performing at a constant rate with respect to the critical
section time, whereas spin-read improves as the critical sections grow
longer.
The authors also present a queuing model approach to analyzing the
behavior of the approaches. The results reflect much of the more
empirical data, showing that this form of modeling is effective for such
complex interactions.
The evaluations presented in this paper seems relatively limited. While
they are designed to focus on the studied effects, I would be interested
to see more diverse benchmarking results, both at the application level
and the multi-processor architecture level. I think the results provide
a compelling argument to developers of threading architecture: low-level
design decisions have a significant impact on performance.
This archive was generated by hypermail 2.1.6 : Sun Jan 25 2004 - 18:52:18 PST