Retro school children University of Washington Computer Science & Engineering
 CSE 548: Computer Architecture - Winter 2008
  CSE Home   About Us    Search    Contact Info 

 Course Home
   

Homework Assignment 3

Writing Portion

Summarize the arguments for and against providing sequential consistency at the ISA level. What memory model do you believe should be provided at the hardware level? The language level?

You should use these papers to motivate your discussion:

Snoopy Cache Problems

  1. Cache coherence protocols require a total ordering on protocol messages for correctness.  Provide a short scenario where having only a partial ordering leads to incorrect and unrecoverable behavior.
  2. The MESI protocol extends the base MSI protocol by adding an Exclusive state.  Complete these tables to describe the new protocol.  

    The actions read, write, replace refer to local cache actions. 

    The bus transactions are:

    • CR for a coherent read request (on read-miss)
    • CRI coherent read and invalidate (write allocate on write-miss)
    • CI coherent invalidate  (write hit on shared block)
    • WR write back of block
    • CCI cache-to-cache-interference is used to satisfy another cache's request.
  3. Complete 4.6 from the Book

Directory Coherence Problems

  1. Complete 4.16 parts c,e-h
  2. Complete 4.22
  3. Consider the case where the directory sends a getM message to a processor using the protocol from the previous two problems. This means that the directory believes that the modified data resides at that node.
    1. what states (other than M) can the cache controller be in when the Forwarded_getM arrives?
    2. use a sequence of events to demonstrate how this situation can arise.
  4. The protocol assumes FIFO message passing between any two points. Assume the network is Non-FIFO.
    Give a scenario that shows how the protocol can fail.
  5. This protocol uses "silent replacement" of shared data. In other words, nodes may replace shared blocks without notifying the directory. 
    What are the advantages and disadvantages of this decision?
  6. Consider the directory protocol used in the past 5 problems. Under this scheme, a write miss by Pi causes the following sequence of actions:

    1. Pi sends address A as part of a getM to directory
    2. directory looks up A, finds owner processor Pm
    3. directory forwards A as part of getM to Pm,  marks Pi as owner
    4. Pm receives getM, sends a Data message to directory, and invalidates the line
    5. directory sends data to Pi
    6. Pi copies data into cache, enters Modified state, and performs the write

    In essence this scheme results in 2s+2 short messages (inv/ack/getM) where s is the number of sharers and 2 data messages. Now consider the following optimization.

    1. Pi sends address A as part of a getM to directory
    2. directory looks up A, finds owner processor Pm
    3. directory forwards (A,Pi) as part of getM to Pm,  marks Pi as owner
    4. Pm receives getM, sends a Data message to Pi, and invalidates the line
    5. Pi copies data into cache, enters Modified state, and performs the write

    Under this potential optimization we have eliminated a data message, at the expense of leaving the directory without a copy of the current data. 

    Is this scheme correct? If so, compare the latency of 4.22c to value calculated above. Otherwise, provide a short example demonstrating a flaw.

Memory Consistency Problems (Thanks to Krste Asanovic)

Cy D. Fect wants to run the following code sequences on processors P1 and P2, which are part of a two-processor MIPS64 machine. The sequences operate on memory values located at addresses A, B, C, D, E, F, and G, which are all sequentially located in memory (e.g. B is 8 bytes after A).

Initially, M[A], M[B], M[C], M[D], and M[E] are all 0. M[F] is 1, and M[G] is 2. For each processor, R1 contains the address A (all of the addresses are located in a shared region of memory). Also, remember that for a MIPS processor, R0 is hardwired to 0. Note: a semicolon is used to indicate the start of a comment.

        P1                        P2
        ADDI R2,R0,#1   ;R2=1     ADDI R2,R0,#1   ;R2=1
        SD   R2,0(R1)   ;A =R2    SD   R2,8(R1)   ;B =R2
        LD   R3,40(R1)  ;R3=F     LD   R3,48(R1)  ;R3=G
        SD   R3,16(R1)  ;C =R3    SD   R3,16(R1)  ;C =R3
        LD   R4,8(R1)   ;R4=B     LD   R4,0(R1)   ;R4=A
        SD   R4,24(R1)  ;D =R4    SD   R4,32(R1)  ;E =R4

  1. If Cy’s code runs on a system with a sequentially consistent memory model, what are the possible results of execution?

    List all possible results in terms of values of M[C], M[D], and M[E] (All other locations will be the same across all possible execution paths).

  2. Assume now that Cy’s code is run on a system that does not guarantee sequential consistency, but that memory dependencies are not violated for the accesses made by any individual processor. The system has a MEMBAR memory barrier instruction that guarantees the effects of all memory instructions executed before the MEMBAR will be made globally visible before any memory instruction after the MEMBAR is executed. 

    Add MEMBAR instructions to Cy’s code sequences to give the same results as if the system were sequentially consistent. Use the minimum number of MEMBAR instructions.

Now consider a machine that uses finer-grain memory barrier instructions. The following instructions are available: 

  • MEMBARrr guarantees that all read operations initiated before the MEMBARrr will be seen before any read operation initiated after it.
  • MEMBARrw guarantees that all read operations initiated before the MEMBARrw will be seen before any write operation initiated after it.
  • MEMBARwr guarantees that all write operations initiated before the MEMBARwr will be seen before any read operation initiated after it.
  • MEMBARwwguarantees that all write operations initiated before the MEMBARww will be seen before any write operation initiated after it

There is no generalized MEMBAR instruction as in the previous problem.

  1. In total store ordering (TSO), a read may complete before a write that is earlier in program order if the read and write are to different addresses and there are no data dependencies. For a machine using TSO, insert the minimum number of memory barrier instructions into the code sequences for P1 and P2 so that sequential consistency is preserved.
  2. In partial store ordering (PSO), a read or a write may complete before a write that is earlier in program order if they are to different addresses and there are no data dependencies. For a machine using PSO, insert the minimum number of memory barrier instructions into the code sequences for P1 and P2 so that sequential consistency is preserved.
  3. In weak ordering (WO), a read or a write may complete before a read or a write that is earlier in program order if they are to different addresses and there are no data dependencies. For a machine using WO, insert the minimum number of memory barrier instructions into the code sequences for P1 and P2 so that sequential consistency is preserved.

Release consistency (RC) distinguishes between acquire and release synchronization operations. An acquire must complete before any reads or writes following it in program order, while a read or a write before a release must complete before the release. However, reads and writes before an acquire may complete after the acquire, and reads and writes after a release may complete before the release. Consider the following modified versions of the original code sequences.

        P1                                P2
        ADDI R2,R0,#1        ;R2=1        ADDI R2,R0,#1         ;R2=1
        SD   R2,0(R1)        ;A =R2       SD   R2,8(R1)         ;B =R2
        L1:ACQUIRE R2,56(R1) ;SWAP(R2,H)  L1:ACQUIRE R2,56(R1)	;SWAP(R2,H)
        BNEZ R2,L1		          BNEZ R2,L1
        LD   R3,40(R1)       ;R3=F        LD   R3,48(R1)        ;R3=G
        SD   R3,16(R1)       ;C =R3       SD   R3,16(R1)        ;C =R3
        RELEASE R0,56(R1)    ;H=0  	  RELEASE R0,56(R1)	;H=0
        LD   R4,8(R1)        ;R4=B        LD   R4,0(R1)         ;R4=A
        SD   R4,24(R1)       ;D =R4       SD   R4,32(R1)        ;E =R4

  1. In the above sequences, the acquire and release operations modify memory location H, which is located sequentially after G. The acquire operation performs a read and a write, while the release operation performs a write. For a machine using RC, insert the minimum number of memory barrier instructions into the above code sequences for P1 and P2 so that sequential consistency is preserved.

Synchronization

  1. Suppose we have a machine that supports full-empty bits on every word in hardware. This particular machine allows for the following C code functions:

    ST_Special(loc, val)
    writes val to location loc and sets the full bit. If the full bit was already set, a trap is signaled.
    LD_Special(loc)
    waits until the data location's full bit is set, reads the data, clears the full bit, and returns the data as the result.

    Write a C function swap(i,j) that uses these primitives to atomically swap the contents of two locations A[i] and A[j]. You should allow high concurrency (distinct pairs should be able to be swapped concurrently) and avoid deadlock. You can check if the ST_Special trapped by doing if(sttrap){ }

Turning in the assignment

Produce a hard-copy of your answers and bring it to class.

 


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to Joe]