From: Sellakumaran (ksella_at_hotmail.com)
Date: Mon Jan 26 2004 - 00:31:47 PST
This paper discusses the performance implications of the thread management
alternatives for shared memory multiprocessors.
The authors started with a thread package that was another order of
magnitude faster than Presto, an application level runtime library. Small
changes in the organization of data structures and locks had a significant
impact on the performance and the performance depends noticeably on the
algorithm used to queue for locks.
Optimizations: Stack need not be created until threads starts running
(arguments go in to control block).Use free list when thread is deallocated.
Our Thread packages use spin-lock for serialization.
A. Single lock: Central data structures protected by a single lock
- Limited thru put,
B. Multiple locks: Central data structures each protected by a separate
lock
- Increased thru put, but increased latency as well
C. Local free list: Per processor free lists with out locks and a
central locked ready queue
- Balancing is needed to take care of threads that run in different
processors
D. Idle Queue:
E. Local Ready Queue: Per-processor Ready Queues and Per-processor Free
lists
- balancing becomes a bigger issue here.
The authors implemented each one of the above approaches on a Sequent
Symmetry Model A shared-memory multiprocessor. All the code was written in
C.
The elapsed time in second for each thread management alternative to create,
start and finish 1 million null threads was measured for varying number of
processors. The per-processor queues and lists (alternative E) seems to have
better thru put than all the other alternatives whereas it has the higher
latency than the others.
The authors then describe various spin-waiting alternatives.
If a processor finds a lock busy, then the processor can spin-wait until
lock is release or it can be relinquished to do useful work. Spin-waiting
has a hidden cost.
There are three alternatives discussed (specific to the Symmetry Model A)
a) Spin on Xchgb
b) Spin on read
c) Ethernet style back off
The measurements indicate that Ether net style back off is the best of the
three.
The authors then explain the simple queuing network model for the thread
package to demonstrate that the combination of processor degradation due to
bus contention and the effect of lock contention can account for their
measurements. The low-level model represents the effect of bus contention
on processor speed. The high-level model represents the effect of lock
contention on thru put and response time.
The authors with this paper and supporting measurements clearly drive the
points that they conclude:
a) It is possible to implement faster thread package
b) Per-processor data structures can be used to improve thru put
c) Spin-waiting can delay not only the processor waiting for a lock but
also other processors doing work
d) The cost of spin-waiting can be delayed by Ethernet style back off
I think that the paper is interesting to read; it discusses various
alternate spin-lock approaches focusing on thru put and latency; three
approaches to spin-waiting.
This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 00:32:00 PST