CSE 548: Computer Architecture - Winter 2010

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Course Home

Class Project

Project Milestones

Milestone Topic Due date

1 Project Meeting Signup Here -- Done by 1/13

2 Project Proposal 1/20

3 Project Presentation 3/19 8:30AM - 1:00PM

4 Project Reports Due 3/21

Project Groups

Group Name Group Members
cse548a
Punya

cse548b
Tony
DJ

cse548c
Emily
Adrian
Ben

cse548d
Nick
Paramjit

cse548e
Mohammad

cse548f
Owen

Your project group has a workspace directory on the department servers. It is located at /projects/instr/10wi/cse548/[a-f], depending on your group. For instance, group cse548b can work in /projects/instr/10wi/cse548/b.
Project Resources

TraceGen: A Pintool that collects traces of program events

AVBugBench: A collection of programs with a variety of atomicity violation bugs in them

MultiCacheSim: A MESI coherent multiprocessor cache simulator

Pin: A Binary Instrumentation Infrastructure

SuperESCalar Simulator

PowerTOP: A Linux Power Consumption Monitor

GCC Atomic Operation Documentation

PARSEC: A Multi-Threaded Benchmark Suite

SPEC: The Standard Performance Evaluation Benchmark Suite

Project Ideas
Here are a couple project ideas discussed in a little detail to give you a sense of how involved the project you do for this course should be. Feel free to use any of these ideas, or any of the ideas below in the "Other Project Suggestions" section.

Project Description

Performance Correlated Architectural Events in Concurrent Programs Modern computer architectures implement a variety of special registers called "performance counters" that track the frequency of certain architectural events. For example, on Intel processors, a register recording the number of L2 Cache misses is available. As architects we are interested in identifying where we stand to gain performance (or reduce power, or increase reliability). An interesting project might be monitoring many different performance counters over a program's execution, and finding correlations between high/low performance, and high/low values of some performance counters. Doing so, we are better guided to operations we can optimize to get performance. Of particular interest to this course, Intel's Nehalem architecture exposes new performance counters, enabling introspection on multiprocessing related events, such as cache coherence activity -- correlations between these events and performance behavior might be of particular interest. An especially interesting project would be using the collected correlations to direct an static or dynamic program optimization.

Identifying Consequential Compiler Reorderings Compilers frequently re-order memory accesses in programs during the optimization process. For example, an access to some variable inside a loop may be hoisted above the loop if its value doesn't change during the loop's execution. This is valid for single-threaded programs because if the variable's value is unchanged in the loop, then it is effectively constant. In multi-threaded programs, however, some other thread could be concurrently accessing that variable -- the access can no longer be hoisted. Typically this is handled conservatively by disabling such re-ordering optimizations in multithreaded code accessing shared variables. An interesting project would be to identify snippets of code subject to such re-orderings (simple enough by analyzing single-threaded compilations). Programs with such reorderings can then be made multithreaded (preserving the reorderings from the single-threaded compilation). Running the programs with concurrent accesses to the variables involved in the re-ordered accesses could result in violations of Sequential Consistency. It would be interesting to determine how often this happens; How frequently are the values produced by re-ordered stores observed by other threads before they should be? How frequently do values read by re-ordered reads not reflect the state of memory at the point in the execution when the read would actually have executed? This would characterize the impact of compiler optimizations on memory consistency. Also, it might reveal some semi-safe optimization opportunities: re-orderings that are in theory unsafe, but that in execution rarely ever (or never) result in violations of SC.

Implementing and Characterizing a Lock-Free Data Structure Lock free data structures are high-performance containers designed to be accessed by multiple threads, without forcing any of the accessing threads to be prevented from making progress (waiting). Typically, the implementations of these data structures make use of atomic compare-and-swap (CAS) operations to make atomic read and update accesses. These atomic operations are available on most modern multiprocessor architectures, as either a CAS instruction that atomically reads and updates a word, or as a Load-Linked/Store-Conditional instruction pair that ensures atomicity of a read and an update performed between them. An interesting project would be to implement a lock free data structure (such as a lock-free double-ended queue) using these hardware primitives that are available on, for instance, Intel processors. Then, by comparing your lock-free version of the data-structure with a version that is implemented using standard synchronization techniques (mutexes, etc), it would be interesting to identify the performance and power consumption benefits (costs?) of using a lock-free structure. Additionally, since lock-free structures tend to be more complex to implement, a simple comparison of the complexity of implementing each might be interesting as well.

Identifying The Costs/Benefits of Cavalier Synchronization Elimination Multi-threaded programs impose order on memory accesses made from different threads using synchronization operations. The goal of this is to ensure sequentially consistent execution. Sometimes, however these synchronization operations may be completely or mostly unnecessary. An example of this would be a program in which the acquisition of some lock is implied by the acquisition of some other lock (that is, the two locks are always acquired together). In this example, as long as one lock's acquisition is truly implied by the other's, the implied lock acquisition can be eliminated, and the two can be replaced with one. Another example is when some lock is acquired, but no concurrent accesses ever occur, eliminating the need for the ordering constraint. An interesting investigation would be to see how much synchronization is actually necessary. To do this, one might begin by simply replacing all synchronization operations with no-ops. This will likely result in program crashes, but working backwards, one could re-introduce synchronization until the program was able to reliably (or unreliably?) execute again. If some synchronization is identified that is rarely or never useful, the performance gains of its elimination can be identified. In addition, because the program without the synchronization is likely broken, it might be interestign to quantify *how* broken it is, by evaluating its MTTF, or the frequency with which it crashes or produces incorrect results.

Other Project Suggestions

Can hardware make software bugs less likely to manifest?

What are applications that tolerate computing errors (image manipulation, video, search?)

What can the architecture provide to make modern languages faster?

e.g., dynamic language like Javascript, perl, ruby, ...

Are there architecture opportunities for desktop data mining?

text manipulation?, graph analysis?

Architecture and Security

Are current multiprocessor architectures insecure? Are there issues regarding isolation of concurrent computation? Can side-channel information be exploited?

Timing of memory accesses, coherence, etc

Simplifying dynamic information flow tracking

Can we design compilers to generate more power efficient code?

Can we profile the behavior of mobile applications to learn about their power or performance characteristics in a way that we can take advantage of?

Can we use parallelism to reduce power consumption? How can we measure this?

Is there a way we can memoize computation across cores in a multiprocessor?

For instance, we call some function, and see that another core has executed that function with the same input memory state. Then we can just use the result of that computation, instead of actually performing it.

How can we reduce the complexity of key components?

As you will see, some processor components are very ugly

How unreliable can you make a processor that remains usable (maybe only statistically correct, even)?

By making the processor more unreliable, we might be able to make the processor more cheaply.

How much data redundancy exists in the caches of a CMP? Can we take advantage of it somehow?

Can we more efficiently use the cores in a multiprocessor by "moving computation" between cores, instead of moving data between cores?

How well can branches and memory dependences be predicted using machine learning? Can looking at the accuracies of different techniques lend insight into the problems? Could a tiered approach be developed, in which, simple branches are predicted by simple predictors and hard ones with fancy techniques?

Fast, Cheap, non-volatile storage

How would the memory hierarchy be different using only non-volatile storage?

How would the OS be different?

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to blucia0a]