Review of 'The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors' paper

From: Praveen Rao (psrao_at_windows.microsoft.com)
Date: Mon Jan 26 2004 - 16:55:53 PST

  • Next message: Clifford B Schmidt: "FW: Cliff Schmidt's review of Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors"

    The paper discusses various thread management alternatives and their
    implications.

     

    First, a case is made for having threads for parallelism. Processes can
    be used for parallelism but are heavyweight. Processes have a volume of
    data structures to be initialized (page tables, swap images, outstanding
    I/O, registers) that makes them expensive to initialize.

     

    Threads in contrast need limited amount of data and can be initialized
    quicker compared to processes. Threads have

    - program counter

    - stack (registers get saved on the stack as well)

    - control block (state information needed for management of
    thread, e.g. to put it on queues, synchronizing with it)

     

    associated with them. Threads need to be passed in arguments at start up
    - these would be copied to thread stack but to delay initialize the
    stack, arguments are kept on control block, and only when thread starts
    to execute, are they copied over to stack. Another optimization is to
    reuse stack and control blocks as threads start and finish. Threads get
    put on ready queues when they are ready to be executed and condition
    queues if they are waiting for a condition to be satisfied (e.g. waiting
    for a lock held by another thread). In the latter case, when the
    condition gets satisfied, thread is moved over to the ready queue.

     

    Free lists of stack and control blocks and ready queues are examples of
    shared data structures used for thread management. There are various
    alternatives for managing these and they have throughput and latency
    implications.

     

    Latency is the cost of thread management when there is no contention.
    Throughput is the rate at which threads can be created, started,
    finished in case of contention. There is usually a tradeoff between
    latency and throughput, though keeping per processor data structures can
    be attempted to improve both.

     

    There are five alternatives discussed in the paper:

    A. Single lock: shared data structures are protected with a single
    lock. This leads to serialization of thread management code and limits
    throughput severely, though latency id low as there is a single lock to
    be acquired.
    B. Multiple locks: each central data structure is protected with a
    different lock - this increases throughput as compared to A but
    increases latency as there are more locks to access
    C. Local (per-processor) unlocked free lists, central locked ready
    queue - free lists can be maintained per-processor obviating the need
    for locking it. These lists can get highly unbalanced though if one
    processor starts the threads and some other finished the threads. To
    circumvent this, these lists are balanced. Local free lists can increase
    the total space taken by the free lists, thus they trade space for time.
    This works fine for free lists as they are small data structures. This
    doesn't work very well for the stack though which can be large. Local
    stack free lists are allowed to have only one element.
    D. Idle queue: central queue of idle-processors, local free lists
    as discussed above: There is a parallelism that can be exploited - when
    a processor creates a thread, it can't allocate stack until it is
    finished allocating and initializing the control block. Some idle
    processor can do some of this work in parallel reducing the cost of
    thread creation. An idle processor creates a stack and control block and
    puts itself on idle list waiting for work. Work finds the processor in
    this alternative (instead of processor finding work as in ready queues).
    This approach trades off latency from the situation of no idle processor
    to the situation when there are idle processors.
    E. Local ready queue, local free lists: This approach tries to
    remove the bottleneck of single ready/idle queue. Unlike the case of
    control block free lists, unlocked ready queues can be inefficient even
    if balanced (e.g. an idle processor could have empty ready queue, while
    there is a runnable thread is a busy processor's queue and global pool
    is empty). To avoid this idle processor needs to search all queues for
    work starting with its own, and each queue has to be locked. In the
    worst case of a single processor enquing and dequing work its queue
    becomes equivalent to a central queue and worse yet other processors
    have to find it. To avoid this, a queue can be randomly chosen for a new
    thread. If each queue is equally likely to get a new thread, latency is
    bad when number of runnable threads is close to number of processors as
    the cost of finding a runnable thread would include scanning many
    queues. One reason to have a one to one correspondence between
    processors and ready queues is locality. Migrating an applications
    thread to a different processor can cause cache misses. If locality is
    not important then more queues increase throughput but increase latency.

     

    The paper then discusses the experiments with these alternatives. As
    expected, local ready queues provide the best throughput which becomes
    more and more marked as the number of processors increase. Latency is
    lowest with idle queues when the number of runnable threads are less
    than the number of processors but becomes highest when the number of
    runnable threads increases beyond the number of processors. This is
    because thread creation is faster if there is an idle processor which
    can help with the thread creation. If there is no idle processor this
    only adds overhead. In case of higher number of runnable threads local
    free lists have the lowest latency as they don't incur the overhead as
    in the case of other alternatives.

     

    The experiments reveal several enlightening things:

    -Adding even single lock in thread mgmt increases latency immensely. Per
    processor data structures are thus important in increasing throughput
    without sacrificing latency

    -Additional complexity in thread mgmt path increases latency. Even a
    small number of additional instructions can have a large adverse effect.

    -A large portion of thread mgmt path is locked (as this path mostly
    modified shared data) hence single lock has limited throughput. Even
    local free lists are not good enough. Large throughput results from
    local ready queues.

    -When lock contention is not a problem, the bandwidth of the bus limits
    the throughput (on any bus structured shared-memory system)

     

    The paper then discusses spin lock management alternatives - another
    important aspect of parallelism. It is highlighted that spinning
    processors slow down the working processors because they result in bus
    contention as they check for the status of the lock. Ethernet style
    backoff in checking the lock state is proposed as a solution to this
    problem. In this solution idling processors cause lesser and lesser
    interruption to the working processors.

     

    Overall this was a very nice paper that presented a good overview of
    thread management and presented and analyzed various thread management
    alternatives. It did not touch upon the condition queues and effect of
    thread synchronization on the thread management alternatives.


  • Next message: Clifford B Schmidt: "FW: Cliff Schmidt's review of Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors"

    This archive was generated by hypermail 2.1.6 : Mon Jan 26 2004 - 16:56:01 PST