From: Praveen Rao (psrao_at_windows.microsoft.com)
Date: Mon Jan 26 2004 - 16:55:53 PST
The paper discusses various thread management alternatives and their
implications.
First, a case is made for having threads for parallelism. Processes can
be used for parallelism but are heavyweight. Processes have a volume of
data structures to be initialized (page tables, swap images, outstanding
I/O, registers) that makes them expensive to initialize.
Threads in contrast need limited amount of data and can be initialized
quicker compared to processes. Threads have
- program counter
- stack (registers get saved on the stack as well)
- control block (state information needed for management of
thread, e.g. to put it on queues, synchronizing with it)
associated with them. Threads need to be passed in arguments at start up
- these would be copied to thread stack but to delay initialize the
stack, arguments are kept on control block, and only when thread starts
to execute, are they copied over to stack. Another optimization is to
reuse stack and control blocks as threads start and finish. Threads get
put on ready queues when they are ready to be executed and condition
queues if they are waiting for a condition to be satisfied (e.g. waiting
for a lock held by another thread). In the latter case, when the
condition gets satisfied, thread is moved over to the ready queue.
Free lists of stack and control blocks and ready queues are examples of
shared data structures used for thread management. There are various
alternatives for managing these and they have throughput and latency
implications.
Latency is the cost of thread management when there is no contention.
Throughput is the rate at which threads can be created, started,
finished in case of contention. There is usually a tradeoff between
latency and throughput, though keeping per processor data structures can
be attempted to improve both.
There are five alternatives discussed in the paper:
A. Single lock: shared data structures are protected with a single
lock. This leads to serialization of thread management code and limits
throughput severely, though latency id low as there is a single lock to
be acquired.
B. Multiple locks: each central data structure is protected with a
different lock - this increases throughput as compared to A but
increases latency as there are more locks to access
C. Local (per-processor) unlocked free lists, central locked ready
queue - free lists can be maintained per-processor obviating the need
for locking it. These lists can get highly unbalanced though if one
processor starts the threads and some other finished the threads. To
circumvent this, these lists are balanced. Local free lists can increase
the total space taken by the free lists, thus they trade space for time.
This works fine for free lists as they are small data structures. This
doesn't work very well for the stack though which can be large. Local
stack free lists are allowed to have only one element.
D. Idle queue: central queue of idle-processors, local free lists
as discussed above: There is a parallelism that can be exploited - when
a processor creates a thread, it can't allocate stack until it is
finished allocating and initializing the control block. Some idle
processor can do some of this work in parallel reducing the cost of
thread creation. An idle processor creates a stack and control block and
puts itself on idle list waiting for work. Work finds the processor in
this alternative (instead of processor finding work as in ready queues).
This approach trades off latency from the situation of no idle processor
to the situation when there are idle processors.
E. Local ready queue, local free lists: This approach tries to
remove the bottleneck of single ready/idle queue. Unlike the case of
control block free lists, unlocked ready queues can be inefficient even
if balanced (e.g. an idle processor could have empty ready queue, while
there is a runnable thread is a busy processor's queue and global pool
is empty). To avoid this idle processor needs to search all queues for
work starting with its own, and each queue has to be locked. In the
worst case of a single processor enquing and dequing work its queue
becomes equivalent to a central queue and worse yet other processors
have to find it. To avoid this, a queue can be randomly chosen for a new
thread. If each queue is equally likely to get a new thread, latency is
bad when number of runnable threads is close to number of processors as
the cost of finding a runnable thread would include scanning many
queues. One reason to have a one to one correspondence between
processors and ready queues is locality. Migrating an applications
thread to a different processor can cause cache misses. If locality is
not important then more queues increase throughput but increase latency.
The paper then discusses the experiments with these alternatives. As
expected, local ready queues provide the best throughput which becomes
more and more marked as the number of processors increase. Latency is
lowest with idle queues when the number of runnable threads are less
than the number of processors but becomes highest when the number of
runnable threads increases beyond the number of processors. This is
because thread creation is faster if there is an idle processor which
can help with the thread creation. If there is no idle processor this
only adds overhead. In case of higher number of runnable threads local
free lists have the lowest latency as they don't incur the overhead as
in the case of other alternatives.
The experiments reveal several enlightening things:
-Adding even single lock in thread mgmt increases latency immensely. Per
processor data structures are thus important in increasing throughput
without sacrificing latency
-Additional complexity in thread mgmt path increases latency. Even a
small number of additional instructions can have a large adverse effect.
-A large portion of thread mgmt path is locked (as this path mostly
modified shared data) hence single lock has limited throughput. Even
local free lists are not good enough. Large throughput results from
local ready queues.
-When lock contention is not a problem, the bandwidth of the bus limits
the throughput (on any bus structured shared-memory system)
The paper then discusses spin lock management alternatives - another
important aspect of parallelism. It is highlighted that spinning
processors slow down the working processors because they result in bus
contention as they check for the status of the lock. Ethernet style
backoff in checking the lock state is proposed as a solution to this
problem. In this solution idling processors cause lesser and lesser
interruption to the working processors.
Overall this was a very nice paper that presented a good overview of
thread management and presented and analyzed various thread management
alternatives. It did not touch upon the condition queues and effect of
thread synchronization on the thread management alternatives.
This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 16:56:01 PST