Fast Threads Review

From: Brian Milnes (brianmilnes_at_qwest.net)
Date: Mon Jan 26 2004 - 15:10:49 PST

  • Next message: Ankur Rawat \(Excell Data Corporation\): "thread management implications."

    The Performance Implications of Thread Management Alternatives for Shared
    Memory Multiprocessors - Anderson, Lazowska and Levy

                The authors review thread management algorithms and build a
    thread scheduler that is an order of magnitude faster than any at that time.
    They do this by carefully measuring the possible algorithms and by
    analytically modeling them as a queuing system.

                They start with an algorithm that allocates a threads stack when
    it first runs and keeps a free list of stacks and control blocks and focus
    on time to create, startup and finish null threads.

                They focus on a sequence of algorithm refinements. First, a
    single lock on a ready queue with idle processors checking that the queue is
    non empty before trying the lock. Second, locking the run queue, control
    block free list and stack free list separately. This trades off increased
    throughput for increased latency.

    Third they use a set of global free lists, a global run queue and a per
    processor control block free list and a per processor single free stack. The
    control block free list is balanced after it hits a threshold. Fourth, they
    add in idle processors queuing themselves on a free list and handling new
    thread creation. Fifth, you can lock all of the ready queues and have
    processors scan them. This complicates many issues including cache locality.

    They benched this on a 20 x86 CPU Sequent and did 1,000,000
    create/start/finishes with a warmed cache. Threads are only 10 times more
    expensive than a procedure call and 500 times cheaper than a process. Local
    ready queues has better throughput and scalability by up to a factor of
    eight.

    They then study lock contention and produce a faster but unfair collision
    based lock back-off scheme. They test this by measuring the amount of time
    to count to 1M under a shared lock and varied the number of processors. The
    collision based back-off has about eight times more throughput. They also
    vary the size of the critical section and show that the collision based
    back-off is still almost four times faster for work units of up to 300 ms.

     The unfairness of these collision based locks does seem problematic. I
    wonder if there might be some better way to make this fairer. Perhaps you
    could keep a global counter that each new arrival reads to determine if he
    should wait a lot or none. When the lock was grabbed the writer could update
    it with its number of retrys.

    9. Schedule Activations: Effective Kernel Support for the User-Level
    Management of Parallelism - Anderson, Bershad, Lazowska and Levy

    User level threads can be fast. They treat each process as a virtual
    processor. Multiprogramming, I/O and page faults can distort this model.
    Kernel level threads are too slow, so user level threads have been
    multiplexed on top of them.

     They fix this. If a thread does not need the kernel, they perform like user
    level threads. If you need the kernel for I/O or processor reallocation,
    they can get the effect of kernel threads. No processor idles in the
    presence of ready threads. No high priority thread waits for a low priority
    thread. When a thread blocks its processor can be reallocated. User level
    threads can have customized scheduling.

    Each application gets a virtual multiprocessor. The OS controls processor
    allocation among address spaces. The kernel informs the process thread
    scheduler of changes such as the withdrawing of CPUs. They built this with
    small changes to FastThreads and the Topaz kernel.


  • Next message: Ankur Rawat \(Excell Data Corporation\): "thread management implications."

    This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 15:10:58 PST