Homework Assignment 3

Writing Portion

Summarize the arguments for and against providing sequential consistency at the ISA level. What memory model do you believe should be provided at the hardware level? The language level?

You should use these papers to motivate your discussion:

"Shared Memory Consistency Models: A Tutorial", Sarita Adve, Kourosh Gharachorloo, DEC WRL Techreport 95/7, September 1995.
"Multiprocessors should support simple memory-consistency models", Mark Hill, IEEE Computer, August 1998

Snoopy Cache Problems

Cache coherence protocols require a total ordering on protocol messages for correctness. Provide a short scenario where having only a partial ordering leads to incorrect and unrecoverable behavior.
The MESI protocol extends the base MSI protocol by adding an Exclusive state. Complete these tables to describe the new protocol.
The actions read, write, replace refer to local cache actions.
The bus transactions are:
- CR for a coherent read request (on read-miss)
- CRI coherent read and invalidate (write allocate on write-miss)
- CI coherent invalidate (write hit on shared block)
- WR write back of block
- CCI cache-to-cache-interference is used to satisfy another cache's request.
Complete 4.6 from the Book

Directory Coherence Problems

Complete 4.16 parts c,e-h
Complete 4.22
Consider the case where the directory sends a getM message to a processor using the protocol from the previous two problems. This means that the directory believes that the modified data resides at that node.
1. what states (other than M) can the cache controller be in when the Forwarded_getM arrives?
2. use a sequence of events to demonstrate how this situation can arise.
The protocol assumes FIFO message passing between any two points. Assume the network is Non-FIFO.
Give a scenario that shows how the protocol can fail.
This protocol uses "silent replacement" of shared data. In other words, nodes may replace shared blocks without notifying the directory.
What are the advantages and disadvantages of this decision?
Consider the directory protocol used in the past 5 problems. Under this scheme, a write miss by P_i causes the following sequence of actions:
1. P_i sends address A as part of a getM to directory
2. directory looks up A, finds owner processor P_m
3. directory forwards A as part of getM to P_m, marks P_i as owner
4. P_m receives getM, sends a Data message to directory, and invalidates the line
5. directory sends data to P_i
6. P_i copies data into cache, enters Modified state, and performs the write
In essence this scheme results in 2s+2 short messages (inv/ack/getM) where s is the number of sharers and 2 data messages. Now consider the following optimization.
1. P_i sends address A as part of a getM to directory
2. directory looks up A, finds owner processor P_m
3. directory forwards (A,P_i) as part of getM to P_m, marks P_i as owner
4. P_m receives getM, sends a Data message to P_i, and invalidates the line
5. P_i copies data into cache, enters Modified state, and performs the write
Under this potential optimization we have eliminated a data message, at the expense of leaving the directory without a copy of the current data.

Is this scheme correct? If so, compare the latency of 4.22c to value calculated above. Otherwise, provide a short example demonstrating a flaw.

Memory Consistency Problems (Thanks to Krste Asanovic)

Cy D. Fect wants to run the following code sequences on processors P1 and P2, which are part of a two-processor MIPS64 machine. The sequences operate on memory values located at addresses A, B, C, D, E, F, and G, which are all sequentially located in memory (e.g. B is 8 bytes after A).

Initially, M[A], M[B], M[C], M[D], and M[E] are all 0. M[F] is 1, and M[G] is 2. For each processor, R1 contains the address A (all of the addresses are located in a shared region of memory). Also, remember that for a MIPS processor, R0 is hardwired to 0. Note: a semicolon is used to indicate the start of a comment.

        P1                        P2
        ADDI R2,R0,#1   ;R2=1     ADDI R2,R0,#1   ;R2=1
        SD   R2,0(R1)   ;A =R2    SD   R2,8(R1)   ;B =R2
        LD   R3,40(R1)  ;R3=F     LD   R3,48(R1)  ;R3=G
        SD   R3,16(R1)  ;C =R3    SD   R3,16(R1)  ;C =R3
        LD   R4,8(R1)   ;R4=B     LD   R4,0(R1)   ;R4=A
        SD   R4,24(R1)  ;D =R4    SD   R4,32(R1)  ;E =R4

If Cy’s code runs on a system with a sequentially consistent memory model, what are the possible results of execution?

List all possible results in terms of values of M[C], M[D], and M[E] (All other locations will be the same across all possible execution paths).
Assume now that Cy’s code is run on a system that does not guarantee sequential consistency, but that memory dependencies are not violated for the accesses made by any individual processor. The system has a MEMBAR memory barrier instruction that guarantees the effects of all memory instructions executed before the MEMBAR will be made globally visible before any memory instruction after the MEMBAR is executed.

Add MEMBAR instructions to Cy’s code sequences to give the same results as if the system were sequentially consistent. Use the minimum number of MEMBAR instructions.

Now consider a machine that uses finer-grain memory barrier instructions. The following instructions are available:

MEMBAR_rr guarantees that all read operations initiated before the MEMBAR_rr will be seen before any read operation initiated after it.
MEMBAR_rwguarantees that all read operations initiated before the MEMBAR_rw will be seen before any write operation initiated after it.
MEMBAR_wrguarantees that all write operations initiated before the MEMBAR_wr will be seen before any read operation initiated after it.
MEMBAR_wwguarantees that all write operations initiated before the MEMBAR_ww will be seen before any write operation initiated after it

There is no generalized MEMBAR instruction as in the previous problem.

In total store ordering (TSO), a read may complete before a write that is earlier in program order if the read and write are to different addresses and there are no data dependencies. For a machine using TSO, insert the minimum number of memory barrier instructions into the code sequences for P1 and P2 so that sequential consistency is preserved.
In partial store ordering (PSO), a read or a write may complete before a write that is earlier in program order if they are to different addresses and there are no data dependencies. For a machine using PSO, insert the minimum number of memory barrier instructions into the code sequences for P1 and P2 so that sequential consistency is preserved.
In weak ordering (WO), a read or a write may complete before a read or a write that is earlier in program order if they are to different addresses and there are no data dependencies. For a machine using WO, insert the minimum number of memory barrier instructions into the code sequences for P1 and P2 so that sequential consistency is preserved.

Release consistency (RC) distinguishes between acquire and release synchronization operations. An acquire must complete before any reads or writes following it in program order, while a read or a write before a release must complete before the release. However, reads and writes before an acquire may complete after the acquire, and reads and writes after a release may complete before the release. Consider the following modified versions of the original code sequences.

        P1                                P2
        ADDI R2,R0,#1        ;R2=1        ADDI R2,R0,#1         ;R2=1
        SD   R2,0(R1)        ;A =R2       SD   R2,8(R1)         ;B =R2
        L1:ACQUIRE R2,56(R1) ;SWAP(R2,H)  L1:ACQUIRE R2,56(R1)	;SWAP(R2,H)
        BNEZ R2,L1		          BNEZ R2,L1
        LD   R3,40(R1)       ;R3=F        LD   R3,48(R1)        ;R3=G
        SD   R3,16(R1)       ;C =R3       SD   R3,16(R1)        ;C =R3
        RELEASE R0,56(R1)    ;H=0  	  RELEASE R0,56(R1)	;H=0
        LD   R4,8(R1)        ;R4=B        LD   R4,0(R1)         ;R4=A
        SD   R4,24(R1)       ;D =R4       SD   R4,32(R1)        ;E =R4

In the above sequences, the acquire and release operations modify memory location H, which is located sequentially after G. The acquire operation performs a read and a write, while the release operation performs a write. For a machine using RC, insert the minimum number of memory barrier instructions into the above code sequences for P1 and P2 so that sequential consistency is preserved.

Synchronization

Suppose we have a machine that supports full-empty bits on every word in hardware. This particular machine allows for the following C code functions:

ST_Special(loc, val)
writes val to location loc and sets the full bit. If the full bit was already set, a trap is signaled.

LD_Special(loc)
waits until the data location's full bit is set, reads the data, clears the full bit, and returns the data as the result.

Write a C function swap(i,j) that uses these primitives to atomically swap the contents of two locations A[i] and A[j]. You should allow high concurrency (distinct pairs should be able to be swapped concurrently) and avoid deadlock. You can check if the ST_Special trapped by doing if(sttrap){ }


	Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to Joe]