From: Raz Mathias (razvanma_at_exchange.microsoft.com)
Date: Mon Jan 26 2004 - 16:44:32 PST
This paper deals with the efficient implementation of threads in
multiprocessor systems. It begins by stating the fact that the purpose
of threads is to separate the execution context from process accounting
(including paging, handles to resources, virtual memory, etc.). On
uniprocessor systems threads allow for efficient overlapping of
processing with I/O; on multiprocessor systems, we overlap the
processing itself.
The paper mentioned some general ways of improving throughput which, I
think, are worth keeping in mind when doing any kind of parallel
programming. The first way is to check if a resource is available
before trying to obtain a lock for it. This way, if the resource is not
available, the thread can continue doing other useful work and can try
again later. Another way is to divide the work of a single coarse lock
amongst many smaller ones; this has the side effect of increasing
latency as well as throughput.
The paper's first major contribution was the description of various
thread management strategies and their performance implications. The
basic implementations were the following:
1. Using a single, global lock - obviously yielding poor latency
and throughput,
2. Using multiple fine-grained locks - increasing throughput, but
latency as well
3. Using per-processor free lists - shifting the throughput
bottleneck onto the ready queue. To make this implementation work well,
we need to load-balance the free lists amongst the processors. With
this approach, we run into the problem of having to pre-allocate thread
stacks for every element in the free list, which can get expensive.
4. Keeping a list of idle processors for starting threads -
lowering latency when there are idle processors available and increasing
latency by a large amount when there are none
5. Keeping per-processor ready queues - performance can be poor if
we have a processor with no threads on its free lists. One solution
presented was to have idle processors search the ready queues of all
other processors for work to do; the impact is that time spent searching
could have been time spent processing.
Performance-wise, it appears that the local-ready-queues solution scaled
the best (throughput-wise) out of the alternatives. It also appears
that this solution had the largest amount of latency. The
idle-processor solution had the best latency characteristics (up to the
point where there are no idle processors); this advantage in latency was
offset by its modest latency performance.
The paper then stated that blocking cannot be utilized for
multiprocessor synchronization because of their poor performance
characteristics. Three spin-lock alternatives were then presented that
depended only on a test-and-set() operation. The first alternative was
to block on test-and-set(). The second was to only block on
test-and-set() after we've read the value of the lock and found that it
wasn't being used. The third was to use an exponential back-off when
trying to obtain a lock. I found the discussion interesting about how
running the test-and-set() operation invalidates CPU caches causing
other running threads to be impacted negatively performance-wise.
I though that this paper presented a good overview of the multiprocessor
thread scheduling alternatives and presented salient statistics on the
performance of each alternative. I would really like to know whether
Linux or Windows use the exponential back-off algorithm in their
multiprocessor thread-scheduling implementations.
This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 16:44:31 PST