Due: May 18th via Canvas.
Questions about this assignment should be posted on the Discussion Board.
Turn in an electronic copy in PDF format to the Canvas.
Shared memory and message passing are fundamentally different models for multiprocessing. Each implies a different hardware implementation as well as a different view from the programmer’s perspective. In about two paragraphs, discuss the differences between shared memory and message passing. Consider the aspects of programmability (ease and expressiveness of programming) and performance. (There’s no need to argue that one model is “better” than the other; just discuss the trade-offs involved.)
Shared-memory multiprocessors and message-passing computers are implementations of the above concurrency models. In addition, vector machines are computers that support “single instruction, multiple data” (SIMD) programs. For each of these three classes of machines, give one example of an application that is well-suited to the architecture. For each, briefly argue why the architecture in question is better-suited to the application than the other two architectures.
The barrier synchronization primitive is a simple construct for coordinating multiple threads of execution. Describe a situation in which a barrier is an appropriate form of synchronization (i.e., more appropriate than other primitives such as mutual exclusion locks). Then describe a scenario in which the use of a barrier can lead to poor performance.
For the first part below, assume an architecture with a Weak Ordering memory consistency model:
Lazy Initialization & Double-Checked Locking
There is a common idiomatic programming technique called “lazy initialization” of an object. A pointer is initially NULL, and remains NULL until its first use, at which point a new object is created, and the pointer is set to point to the new object. This obviously requires a NULL check before every access to this pointer—if it is NULL, then the object must be created, and if not, then the object can be used.
As an optimization for lazy initialization with multiple threads, some clever person invented “double-checked locking.” In double-checked locking, the NULL check is made efficient by accessing the pointer outside of a critical region (i.e., without holding the lock associated with the pointer). If this check indicates the pointer is NULL, then the lock is acquired, and a second NULL check is performed (to be sure nothing changed between the first NULL check and the lock acquire). Then, under the protection of the lock, the object’s constructor is called, and the pointer is set to point to the new object. The relevant code is listed below:
// Initially x = NULL and x is shared.
// Mutual exclusion is enforced with lock L.
...
if (x == NULL) {
LOCK(L);
if (x == NULL) {
x = new X();
}
UNLOCK(L);
}
print x.a + " " + x.b + " " + x.c;
...
class X {
...
X() {
this.a = 0;
this.b = 0;
this.c = 0;
}
}
Is this optimization safe? If not, what could possibly go wrong?
Identify the Outputs
For this problem, consider a weakened version of the weak ordering memory model: weaker ordering. Weaker ordering permits the reorderings permitted by weak ordering, but does not provide write atomicity. Recall that write atomicity ensures that if any processor reads the result of a write, then all other processors must also read that value.
Look at the code for Threads 1, 2, 3 and 4. Threads 3 and 4 are reading the shared variables X and Y into their registers, r1, r2, r3 and r4. Initially, X and Y are 0. Fill in the W.O. column in the table below to indicate the valid output register values under the weaker ordering memory model. If the outcome is valid, mark the table cell with a check; if it is invalid, leave it empty.
Next, do the same but assume Sequential Consistency instead of Weaker Ordering. Fill in the S.C. column in the table below.
Thread 1 | Thread 2 | Thread 3 | Thread 4 |
---|---|---|---|
X = 1 | Y = 1 | r1 = X r2 = Y | r3 = Y r4 = X |
r1 | r2 | r3 | r4 | Weaker Ordering | Sequential Consistency |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | ||
0 | 0 | 0 | 1 | ||
0 | 0 | 1 | 0 | ||
0 | 0 | 1 | 1 | ||
0 | 1 | 0 | 0 | ||
0 | 1 | 0 | 1 | ||
0 | 1 | 1 | 0 | ||
0 | 1 | 1 | 1 | ||
1 | 0 | 0 | 0 | ||
1 | 0 | 0 | 1 | ||
1 | 0 | 1 | 0 | ||
1 | 0 | 1 | 1 | ||
1 | 1 | 0 | 0 | ||
1 | 1 | 0 | 1 | ||
1 | 1 | 1 | 0 | ||
1 | 1 | 1 | 1 |
You are TAing a class, and your students are writing their very first parallel program – It creates a couple of objects, and a couple of threads, and each thread manipulates each object. One student’s code looks like this (in C-like code):
struct Foo {
byte b[10];
}
...
main() {
Foo f;
Foo g;
Thread t1 = createThread(doStuff, &f /*f is an arg to doStuff in t1*/);
Thread t2 = createThread(doStuff, &g /*g is an arg to doStuff in t2*/);
t1.start();
t2.start();
}
...
doStuff (Foo *foo) {
for(i in 1 to 1000){
for(j in 0 to 9){
foo->b[j]++;
}
}
}
The code is pretty simple. However, because of the budget cuts at UW this year, you’re forced to compile your code with GCC--
, a cheap knockoff of the GCC
compiler written by a bunch of undergrads at Rural Ontario Regional Mining Institute (RORMI). You happen to have a real copy of GCC
on your personal computer, and you noticed that when compiled using GCC
, the performance of the application is significantly better than with GCC--
. Explain a possible reason for this, and how, if you were stuck with GCC--
, you could change your code to eliminate the poor performance. Note: it is possible for a compiler to flatten the outer loop into a single add for each array element, but that's not what we're after here.
Suppose you have a multiprocessor that guarantees sequential consistency, all the way down to main memory. Describe how a programmer could still end up with violations of sequential consistency. Include a code example if you can, and discuss the cycle in the happens-before graph. Hint: Think about memory ordering at all levels of the system stack (hardware and software).