From: Manish Mittal (manishm_at_microsoft.com)
Date: Mon Jan 26 2004 - 12:14:51 PST
This paper explores various thread management alternatives and studies
the performance implications of these alternatives on shared memory
multiprocessors. Threads on uniprocessors are used as a program
structuring aid or to overlap IO with processing. Threads on
multiprocessors are used to exploit parallelism. In a multiprocessing
shared environment, shared data structures must be serialized to ensure
consistency and correctness. This is done through locking, which brings
down the rate of threads running to completion (throughput) and
increases the time spent in thread management (latency). The idea is to
maximize throughput and minimize latency. The majority of the article
discusses thread management techniques, the costs associated with each
technique and measurement results based on the tests carried out on
Sequent Symmetry shared memory processor. Paper then compares blocking
with spin-waiting and describes various approaches for spin-waiting.
Five thread management alternatives are listed along with their
advantages and disadvantages: single lock, multiple locks, local free
list, idle queue and local ready queue. The single lock approach is
simple but the throughput is limited. The multiple locks in which each
data structure (free list, ready queue and control block list) has its
own lock improves the throughput but results in increasing latency. The
tradeoff between latency and throughput can be avoided by using local
free list. To balance free lists, Global pool is used. Local free lists
trade space for time, which is acceptable for control blocks but not
practical for stacks. Idle queue technique exploits parallelism in
creating thread. It introduces central queue of Idle processors besides
a queue of threads. The principle of this approach is work searches for
processors instead of processors searching for work. In local ready
queue algorithm, each processor has its own ready queue so enqueueing
and dequeueing threads can occur simultaneously.
Measurement results shows that: adding a single lock mechanism in the
thread management increases latency significantly, Multiple locks
results in lower performance compared to single lock, using per
processors data structures to avoid locking is essential to reducing
latency, bus contention lowers throughput for local ready queue
alternative even though there is no lock contention and idle queue is
faster when there are idle processors but slower when there are more
runnable threads than processors. Another advantage of Local ready queue
is that the Cache locality is maintained.
The paper also covers three different approaches for spin-waiting: spin
on Xchgb (exchange bytes), spin on read and ethernet-style backoff.
Processor loops on the xchgb instruction in spin on xchgb approach. The
Xchgb instruction consumes bus resources thus slowing down the holder of
the lock. In spin on read, the processor can spin reading the lock
memory location until the lock is released. A problem can arise where
there are number of processors waiting for small critical section.
However it is insignificant with long critical sections. The
ethernet-style backoff makes each processor wait a period of time
dependent on the number of past failures. Sometimes this lock mechanism
is unfair and it takes longer for a spinning processor to get a new free
lock. When there are large numbers of spinning processors the bus can be
saturated, slowing down the processor executing in the critical section.
Some of the optimization techniques described in the initial section is
worth mentioning such as copying thread's argument into its control
block when the thread is created to avoid stack allocation until startup
or to store deallocated control blocks & stacks in free list for reuse.
Without these optimizations, it would be difficult to gauge the benefits
of each thread management techniques based on the performance
statistics. This paper was very easy and interesting to read. Techniques
used to test the performance of various thread alternatives is explained
very well (with the code snippets). The implementation and analysis
section was thorough. In the end, author concludes that simplicity is
crucial to implement a fast thread package.
This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 12:14:56 PST