Review: The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors.

From: Sellakumaran (ksella_at_hotmail.com)
Date: Mon Jan 26 2004 - 00:31:47 PST

  • Next message: Chuck Reeves: "The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors"

     

    This paper discusses the performance implications of the thread management
    alternatives for shared memory multiprocessors.

    The authors started with a thread package that was another order of
    magnitude faster than Presto, an application level runtime library. Small
    changes in the organization of data structures and locks had a significant
    impact on the performance and the performance depends noticeably on the
    algorithm used to queue for locks.

     

    Optimizations: Stack need not be created until threads starts running
    (arguments go in to control block).Use free list when thread is deallocated.

    Our Thread packages use spin-lock for serialization.

    A. Single lock: Central data structures protected by a single lock

    - Limited thru put,

    B. Multiple locks: Central data structures each protected by a separate
    lock

    - Increased thru put, but increased latency as well

    C. Local free list: Per processor free lists with out locks and a
    central locked ready queue

    - Balancing is needed to take care of threads that run in different
    processors

    D. Idle Queue:
    E. Local Ready Queue: Per-processor Ready Queues and Per-processor Free
    lists

    - balancing becomes a bigger issue here.

     

    The authors implemented each one of the above approaches on a Sequent
    Symmetry Model A shared-memory multiprocessor. All the code was written in
    C.

    The elapsed time in second for each thread management alternative to create,
    start and finish 1 million null threads was measured for varying number of
    processors. The per-processor queues and lists (alternative E) seems to have
    better thru put than all the other alternatives whereas it has the higher
    latency than the others.

     

    The authors then describe various spin-waiting alternatives.

    If a processor finds a lock busy, then the processor can spin-wait until
    lock is release or it can be relinquished to do useful work. Spin-waiting
    has a hidden cost.

    There are three alternatives discussed (specific to the Symmetry Model A)

    a) Spin on Xchgb

    b) Spin on read

    c) Ethernet style back off

    The measurements indicate that Ether net style back off is the best of the
    three.

     

    The authors then explain the simple queuing network model for the thread
    package to demonstrate that the combination of processor degradation due to
    bus contention and the effect of lock contention can account for their
    measurements. The low-level model represents the effect of bus contention
    on processor speed. The high-level model represents the effect of lock
    contention on thru put and response time.

     

    The authors with this paper and supporting measurements clearly drive the
    points that they conclude:

    a) It is possible to implement faster thread package

    b) Per-processor data structures can be used to improve thru put

    c) Spin-waiting can delay not only the processor waiting for a lock but
    also other processors doing work

    d) The cost of spin-waiting can be delayed by Ethernet style back off

     

    I think that the paper is interesting to read; it discusses various
    alternate spin-lock approaches focusing on thru put and latency; three
    approaches to spin-waiting.

     


  • Next message: Chuck Reeves: "The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors"

    This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 00:32:00 PST