From: Justin Voskuhl (justinv_at_microsoft.com)
Date: Mon Jan 26 2004 - 13:39:25 PST
This paper discusses alternatives for organizing thread management data
structures in the kernel in a shared memory multiprocessor. There are a
number of design alternatives that are compared, then two key
performance metrics are analyzed for the various designs. The first
performance metric is latency (how long on average we have to wait for
service) and throughput (the number of thread operations that can
complete in a given time-period.) Various models are proposed to
explain the performance dynamics that are observed.
There were five major design alternatives given for the threading
architecture inside the kernel. The first simple design is that of a
global lock for all thread-related data structures in the kernel. Then
second design is to let each major data structure have its own lock.
The third design is to let each processor have its own Free list and
have a centrally locked Ready queue. The next design was more
interesting, which was to maintain queues of idle processors, and on
each processor maintain a Free list. The last design was to move most
of the thread-related data structures into per-processor data
structures.
Various performance metrics are computed for each of the major designs.
A program is written which really exercises the thread primitives by
creating, destroying and synchronizing many threads (a million across
all processors.) We see that for most performance metrics the local
ready queue design tends to be the most performant. The observation is
made that simplicity tends to create a performant thread package -
optimizing special cases tends to not pay off because the code path is
run so frequently that you don't make it back with the savings from the
special case.
Because of the architecture of the machine the tests were conducted on,
the performance of spin-locks were problematic. As implemented they
caused a lot of bus traffic back and forth, and this bus traffic
actually interfered with the thread that was making progress, slowing
the whole system. Some alternatives to spin-locks are presented,
including a back-off algorithm inspired by the design of Ethernet. It's
noted however that the problem with this is that it's unfair - threads
that have waited longer are now less likely to acquire the lock.
A formal model of bus contention is presented. The model is used to
predict the behavior of various locking schemes.
This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 13:39:33 PST