From: Gail Rahn (gail_at_screaminggeek.com)
Date: Mon Jan 26 2004 - 11:27:13 PST
Review of "Performance Implications of Thread Management Alternatives..." by
Anderson, Lazowska and Levy
This paper discusses alternatives for algorithms and data structures used in
thread management in multiprocessor systems, and the performance tradeoffs
associated with changes. The researchers improved the thread-management
system in a 20-processor Intel 386 computer. Data structures previously kept
in global memory were moved to local lists at each processor. Alternative
algorithms to spin-waiting ("busy-wait") were evaluated to reduce processor
degredation.
An initial optimization made to thread-management was for an idle processor
to set-up the next running thread by pre-allocating its control block. At
thread-creation, the control block already exists, so only the stack is left
for allocation. (Stacks cannot be preallocated because of the memory
expense.)
The next optimization targeted lock-handling in the thread-management
subsystem. Should thread-realted data structures be protected by one
centralized lock, one-lock-per-data-structure or some other method? The
researchers found that a single lock limits throughput. With multiple locks
for centralized data structures, gains in throughput occur, but bring with
it matching gains in latency as the processors compete for many locks. The
optimal solution, according to the authors, was to use one global lock for
the central data structures and ALSO unlocked per-processor lists of data.
When processors went looking for data in the structures, it searched its
local list first and then, if necessary, accessed the global structure. This
algorithm improved throughput without increasing latency.
An important note is that especially for thread-management software, which
is called so frequently by the operating system, any increases in
complexity also increase latency. Simplicity and code efficiency are crucial
principles here.
Spin-waiting as a part of thread-management code was also evaluated in the
paper because local queuing is a major thread-management bottleneck. The
improvement made by the researchers modified the algorithm from its basic
form (spin the processor while waiting for the lock - causes bus
contention) into an Ethernet-style backoff algorithm (wait for a variable
period of time based on past experience and try again). Switching to a
blocking-wait was rejected because it's inefficient for thread-management
purposes. A blocking-wait implies an expensive context switch for the
processor and locks are held for only a short time. Each wait would entail a
context-switch and that is too great of a cost.
This paper was clearly written and conveyed. It is interesting to read
because the paper addresses a crucial theme in computer architecture,
managing lightweight processes, displays bottlenecks in the existing
architecture. I also appreciated that both theory and implementation were
addressed: the proposed solutions were implemented, tested, analyzed and
proven to be more efficient in PRACTICE as well as in theory, even
considering simulated user tasks. I "believe" this paper more readily than
I would put faith into an OS paper that described only theory and used
logic to support assertions, instead of an actual implemented, functioning
system.
-------------
Gail Rahn
grahn_at_cs.washington.edu
This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 11:27:25 PST