Project | Description |
Performance Correlated Architectural Events in Concurrent Programs |
Modern computer architectures implement a variety of special registers called
"performance counters" that track the frequency of certain architectural
events. For example, on Intel processors, a register recording the number of
L2 Cache misses is available. As architects we are interested in identifying
where we stand to gain performance (or reduce power, or increase reliability).
An interesting project might be monitoring many different performance counters
over a program's execution, and finding correlations between high/low performance,
and high/low values of some performance counters. Doing so, we are better guided
to operations we can optimize to get performance. Of particular interest to
this course, Intel's Nehalem architecture exposes new performance counters,
enabling introspection on multiprocessing related events, such as cache
coherence activity -- correlations between these events and performance
behavior might be of particular interest. An especially interesting project
would be using the collected correlations to direct an static or dynamic
program optimization.
|
Identifying Consequential Compiler Reorderings |
Compilers frequently re-order memory accesses in programs during the
optimization process. For example, an access to some variable inside a loop
may be hoisted above the loop if its value doesn't change during the loop's
execution. This is valid for single-threaded programs because if the
variable's value is unchanged in the loop, then it is effectively constant. In
multi-threaded programs, however, some other thread could be concurrently
accessing that variable -- the access can no longer be hoisted. Typically this
is handled conservatively by disabling such re-ordering optimizations in
multithreaded code accessing shared variables. An interesting project would be
to identify snippets of code subject to such re-orderings (simple enough by
analyzing single-threaded compilations). Programs with such reorderings can
then be made multithreaded (preserving the reorderings from the single-threaded
compilation). Running the programs with concurrent accesses to the variables
involved in the re-ordered accesses could result in violations of Sequential
Consistency. It would be interesting to determine how often this happens; How
frequently are the values produced by re-ordered stores observed by other
threads before they should be? How frequently do values read by re-ordered
reads not reflect the state of memory at the point in the execution when the
read would actually have executed? This would characterize the impact of
compiler optimizations on memory consistency. Also, it might reveal some
semi-safe optimization opportunities: re-orderings that are in theory unsafe,
but that in execution rarely ever (or never) result in violations of SC.
|
Implementing and Characterizing a Lock-Free Data Structure |
Lock free data structures are high-performance containers designed to be
accessed by multiple threads, without forcing any of the accessing threads to
be prevented from making progress (waiting). Typically, the implementations of
these data structures make use of atomic compare-and-swap (CAS) operations to
make atomic read and update accesses. These atomic operations are available on
most modern multiprocessor architectures, as either a CAS instruction that
atomically reads and updates a word, or as a Load-Linked/Store-Conditional
instruction pair that ensures atomicity of a read and an update performed
between them.
An interesting project would be to implement a lock free data structure (such
as a lock-free double-ended queue) using these hardware primitives that are
available on, for instance, Intel processors. Then, by comparing your
lock-free version of the data-structure with a version that is implemented
using standard synchronization techniques (mutexes, etc), it would be
interesting to identify the performance and power consumption benefits (costs?)
of using a lock-free structure. Additionally, since lock-free structures tend
to be more complex to implement, a simple comparison of the complexity of
implementing each might be interesting as well.
|
Identifying The Costs/Benefits of Cavalier Synchronization Elimination |
Multi-threaded programs impose order on memory accesses made from different
threads using synchronization operations. The goal of this is to ensure
sequentially consistent execution. Sometimes, however these synchronization
operations may be completely or mostly unnecessary. An example of this would
be a program in which the acquisition of some lock is implied by the
acquisition of some other lock (that is, the two locks are always acquired
together). In this example, as long as one lock's acquisition is truly implied
by the other's, the implied lock acquisition can be eliminated, and the two can
be replaced with one. Another example is when some lock is acquired, but no
concurrent accesses ever occur, eliminating the need for the ordering
constraint. An interesting investigation would be to see how much
synchronization is actually necessary. To do this, one might begin by simply
replacing all synchronization operations with no-ops. This will likely result
in program crashes, but working backwards, one could re-introduce
synchronization until the program was able to reliably (or unreliably?) execute
again. If some synchronization is identified that is rarely or never useful,
the performance gains of its elimination can be identified. In addition,
because the program without the synchronization is likely broken, it might be
interestign to quantify *how* broken it is, by evaluating its MTTF, or the
frequency with which it crashes or produces incorrect results.
|