From: Jeff Duzak (jduzak_at_exchange.microsoft.com)
Date: Sat Jan 24 2004 - 13:51:18 PST
This paper describes five possible implementations of a thread
management package on multiprocessor machines. The paper is focused on
how lock contention affects the throughput and latency of thread
creation, startup, and finish. Empirical performance results for the
five alternatives is provided. Further, models for two of the
alternatives are created in order to further explore the effect on
performance of certain variables.
The five alternatives start from the most simple and centralized (ie,
having a single central lock for all threading activities) and progress
toward more distributed schemes (ie, having one queue per processor of
threads ready to execute). Intuitive arguments describe the potential
benefits and pitfalls of each approach. The performance of each
alternative is then measured on a test application designed to use very
fine-grained threading (that is, spawning and destroying threads very
frequently, with each thread doing a very small amount of work). The
result was that the local ready-queue approach gives the best
performance for such an application.
Next, a discussion is given regarding three alternatives for multiple
processors spin-waiting on a common lock. The choice of a method to
spin-wait on a lock has a great affect on performance primarily because
of bus contention caused by querying and setting the state of the lock.
Further, it affects how often processor memory caches are invalidated.
This discussion is especially pertinent to thread management routines,
as they inevitably have to use common locks, but the discussion also
applies to any situation involving common locks used by multiple
processors, especially when they are used with high frequency and high
possibility of contention. The discussion concludes that an
ethernet-style backoff routine provides the best performance, although
it has the drawback of not providing fair access to a lock to all
processors.
Finally, models are created to simulate the affect of certain variables,
including execution time of a thread and bus load of a thread, on the
performance of two thread management routines. The models are first
verified against empirical results. Then, the models are used to
predict the behavior of the alternatives in different situations. The
general results are that the local ready queue method provides better
performance than a central ready queue, especially as the number of
processors increases.
This paper is interesting as an exploration of the very fundamentals of
thread-management. The situation of interest (that is, fine-grain
threading) is somewhat foreign to me, as I usually think of threads as
long-lived (as for example, in an application with one UI thread and one
background processing thread). It was also interesting to see how
accurate the formal analysis ended up being, given how complicated of
the actual situation. Last, this paper made me curious to think about
how hardware support could be given to facilitate thread management.
This archive was generated by hypermail 2.1.6 : Sat Jan 24 2004 - 13:51:24 PST