From: Brian Milnes (brianmilnes_at_qwest.net)
Date: Mon Jan 26 2004 - 15:10:49 PST
The Performance Implications of Thread Management Alternatives for Shared
Memory Multiprocessors - Anderson, Lazowska and Levy
The authors review thread management algorithms and build a
thread scheduler that is an order of magnitude faster than any at that time.
They do this by carefully measuring the possible algorithms and by
analytically modeling them as a queuing system.
They start with an algorithm that allocates a threads stack when
it first runs and keeps a free list of stacks and control blocks and focus
on time to create, startup and finish null threads.
They focus on a sequence of algorithm refinements. First, a
single lock on a ready queue with idle processors checking that the queue is
non empty before trying the lock. Second, locking the run queue, control
block free list and stack free list separately. This trades off increased
throughput for increased latency.
Third they use a set of global free lists, a global run queue and a per
processor control block free list and a per processor single free stack. The
control block free list is balanced after it hits a threshold. Fourth, they
add in idle processors queuing themselves on a free list and handling new
thread creation. Fifth, you can lock all of the ready queues and have
processors scan them. This complicates many issues including cache locality.
They benched this on a 20 x86 CPU Sequent and did 1,000,000
create/start/finishes with a warmed cache. Threads are only 10 times more
expensive than a procedure call and 500 times cheaper than a process. Local
ready queues has better throughput and scalability by up to a factor of
eight.
They then study lock contention and produce a faster but unfair collision
based lock back-off scheme. They test this by measuring the amount of time
to count to 1M under a shared lock and varied the number of processors. The
collision based back-off has about eight times more throughput. They also
vary the size of the critical section and show that the collision based
back-off is still almost four times faster for work units of up to 300 ms.
The unfairness of these collision based locks does seem problematic. I
wonder if there might be some better way to make this fairer. Perhaps you
could keep a global counter that each new arrival reads to determine if he
should wait a lot or none. When the lock was grabbed the writer could update
it with its number of retrys.
9. Schedule Activations: Effective Kernel Support for the User-Level
Management of Parallelism - Anderson, Bershad, Lazowska and Levy
User level threads can be fast. They treat each process as a virtual
processor. Multiprogramming, I/O and page faults can distort this model.
Kernel level threads are too slow, so user level threads have been
multiplexed on top of them.
They fix this. If a thread does not need the kernel, they perform like user
level threads. If you need the kernel for I/O or processor reallocation,
they can get the effect of kernel threads. No processor idles in the
presence of ready threads. No high priority thread waits for a low priority
thread. When a thread blocks its processor can be reallocated. User level
threads can have customized scheduling.
Each application gets a virtual multiprocessor. The OS controls processor
allocation among address spaces. The kernel informs the process thread
scheduler of changes such as the withdrawing of CPUs. They built this with
small changes to FastThreads and the Topaz kernel.
This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 15:10:58 PST