No Title

CSE 586 Spring 2000

Computer Architecture

Assignment #4

Array A contains 256 elements of 4 bytes each. Its first element is stored at physical address 4096 ( 4 * 1 K).

Array B contains 512 elements of 4 bytes each. Its first element is stored at physical address 8192 ( 8 * 1 K).

Assume that only arrays A and B can be cached in an initially empty physically addressed, physically tagged (i.e., the usual type of cache), direct-mapped, 2 KB cache with an 8 byte block size.

The following loop is then executed:

         for (i = 0; i < 256; i++)
             A[i] = A[i] + B[2*i];        // Load A[i]; Load B[2i]; 
                                          // Add; Store A[i]

(a) Assume a write-through write-around policy.

What are the contents of the cache at the end of the loop?

How many bytes will be written to memory?

A lot of people were confused by how cache was addressed in this question.

Note the following :

1. Since the cache line size is 8 bytes, the smallest unit of data transfer into cache from L2cache or memory is 8 bytes. So if we have a miss for A[0], both A[0] and A[1] get fetched into cache.

2. The cache is addressed by the lower bits of the address. However the address is byte address, and since a cache line can hold 8 bytes, the lower three bits of the address are used to address bytes inside a cache line. Since the cache is 2K bytes large, it has 2K/8 = 256 cache lines, which are addressed by 8 bits. Hence:
      bits 0-2 form the "offset", which is used to address inside a cache line
      bits 3 through 10 of the address form the cache line address.
      Bits 11-32 form the TAG. (assuming a 32 bit architecture)

Now, Consider the sequence : (2 iterations of the loop )

            load A[0] -> causes A[0] and A[1] at cache line 0
            load B[0] -> *also addresses cache line 0* - so overwrites A[0] & A[1] above
            store A[0] -> Nothing happens to cache (no write allocate) -> 8 bytes are written (8 is the unit of transfer)
            load A[1] -> Accesses the SAME cache line as A[0] So we load A[0] and A[1] again into line 0
            load B[2] ->addresses cache line 1 - load B[2] and B[3] into cache line 1
            store A[1] -> Again nothing happens as above.

So the pattern might be obvious :
At every iteration of the loop, B[2*i] accesses a new cache line, and A overwrites the cache lines every two iterations. Since the loop is 256 iterations, B will just reach cache line 255 when the loop will finish. Since A has been erasing B half as fast, we would have A in the top half of the cache and B in the bottom half. Thus the cache contains : A[0]-A[255] (In the top half) and B[256]- B[511] in the bottom half.

Also, since the cache is write through, the entries in the cache will always be the freshly written entries.
Since this is write through, we have to write 256 words = 1024 bytes (all of A) back to the next level (L2 cache or memory)
Assuming that the minimum transfer from a cache to a lower level is a cache line, this translates to 2048 bytes.

(b) Assume a write-back write-allocate policy.

What are the contents of the cache at the end of the loop?

How many bytes will be written to memory?

With write back/write allocate, the pattern above changes in that a store also needs to allocate before writing if it gets a miss. This means that the B[0]-B[1] cache line is replaced when A[0] is written, unlike when A[1] is read in the above case. Also, note that a dirty cache line is written back only when it is to be replaced.
Since the A[0] store assures that the A[0] - A[1] line is in cache, the load to A[1] hits in cache this time. The same pattern as above is now repeated, with A erasing B every two iterations of the loop. But since no dirty cache line of A is ever overwritten, the no of bytes written to lower level is 0.
   the contents of the cache at the end :
            A[0]..A[255] in the lower end : All entries are the *updated* values of A.
            B[256].. B[511] in the higher end.

Problem 2.

As a designer you are asked to evaluate three possible options for an on-chip write-through data cache. Some of the design options (associativity) and performance consequences (miss rate, miss penalty) are described in the table below.

Data cache options	Miss rate	Miss penalty
Cache A: Direct-mapped	8%	4 cycles
Cache B: 2-way set-associative	4%	6 cycles
Cache C: 4-way set-associative	2%	8 cycles

(a) Assume that Load instructions have a CPI of 1.2 if they hit in the cache and a CPI of (1.2 + Miss penalty) otherwise.
All other instructions have a CPI of 1.2 . The instruction mix is such that 20% of the instructions are Loads.

What is the CPI for each configuration (you'll need to keep 3 decimal digits, i.e., compute each CPI as 1.xxx). Which cache would you choose if CPI is the determining factor.

The formula for the calculation is :
CPI = %loads * (miss rate * miss penalty) + 1.2

Data cache options	Miss rate	Miss penalty	CPI
Cache A: Direct-mapped	8%	4 cycles	1.264
Cache B: 2-way set-associative	4%	6 cycles	1.248
Cache C: 4-way set-associative	2%	8 cycles	1.232

4 way set associative seems to have the best CPI.

(b) Assume now that if the direct-mapped cache is used, the cycle time is 20 ns. If the 2-way set-associative cache is used, the cycle time is 22 ns. If the 4-way set-associative cache is used, the cycle time is 24 ns.

What is the average time per instruction? If average time per instruction is the determining factor, which cache would you choose?

The cycle time is the clock cycle time. Average time per instruction = CPI * clock cycle time.

Data cache options	Miss rate	Miss penalty	CPI	time per instruction
ache A: Direct-mapped	8%	4 cycles	1.264	1.264 * 20 = 25.28ns
Cache B: 2-way set-associative	4%	6 cycles	1.248	1.248 * 22 = 27.456ns
Cache C: 4-way set-associative	2%	8 cycles	1.232	1.232 * 24 = 29.568ns

Direct mapped, if this is the criterion.

(c) In the case of the 2-way set-associative cache, the replacement algorithm is LRU. In the case of the 4-way set-associative cache, the replacement algorithm is such that the Most Recently Used (MRU) block (line) is not replaced (the choice of which of the other 3 is replaced is random and is not part of the logic associated with the cache).

Assume further that at context-switch the whole cache is invalidated.

Indicate what bits are needed for each entry, or set of entries, in addition to the usual tag and data components for each of the 3 cache organizations.

For direct mapped :
              1 bit to say whether the cache line is valid or not.
For 2way set associative :
       1 invalidate bit per entry ; plus 1 bit per set for LRU
For 4 way set associative :
        1 invalidate bit per entry : plus 2 bits for set or 1 bit per entry to say which is MRU.

Problem 3.

Consider a two-level cache hierarchy. The data cache at the first level (L1) is 16 KB, 2-way set-associative, with a block size of 32 bytes. The cache at the second level L2 is 512 KB, 4-way set-associative, and has a block size of 64 bytes.

(a) Show the lengths of the various sub fields (tag,index,displacement) of a 32-bit address as seen:

by the L1 cache
by the L2 cache

The offset (or displacement) is decided by cache line size : For L1 it is 5 bits (log 32), for L2 it is 6 bits (log 64)
The index into cache is decided by the no of cache lines :
             L1 is 16KB 2way SA, so each set contains 8KB, therefore no of lines = 8 KB / 32 bytes = 256 lines
             L2 is 512 KB, 4 way SA, so each set contains 128 KB, so no of lines = 128 KB/ 64 bytes = 2048 lines
            Thus La cache is indexed by 8 bits, and L2 cache by 11 bits
The remaining bits form the tag :
        L1 tag = (32 - 8 - 5) =19
        L2 tag = (32 - 11 - 6) = 15

L1 cache addressing :

19bit tag 8 bit index 5 bit displ0

L2 cache addressing :

15 tag 11 bit index 6 bit displ0

(b) We now introduce a victim cache ``behind'' L1 and ``before'' L2. The victim cache is used only on read misses. The victim cache has a capacity of 4 blocks (32 bytes each) and is fully associative.

What is the size of the tag associated with each victim cache entry?

ANS :
Again split the 32 bits as above : Since the cache is fully associative, we need no index. We again need a displacement of 5 bits as in L1 cache, the remaining is tag (27 bits)

Under these conditions explain why read hits would take 1 cycle for L1 and 2 cycles for the victim cache.
ANS :
You have to detect a read miss in L1 cache on one cycle before you access victim cache, hence you need two cycles to access victim cache.

When there is a miss in L1 and a miss in the victim cache, the missing block is brought in from L2 (or main memory) and stored in L1. The block replaced in L1 replaces the LRU entry in the victim cache.

How is the tag for the new entry in the victim cache deduced from that of the replaced block in L1?
ANS :
Since the tag is 27 bits for victim cache, and the displacement bits are the same, the tag of the victim cache includes the tag of the L1 cache and the index of the entry in L1 cache.

What happens to the contents of the replaced (LRU) entry of the victim cache if:

L1 is write-through

Since L1 is write through, nothing special has to be done when the victim cache entry is evicted, because the write to L2 cache has been done already.

L1 is write-back (Hint: There are several possible answers; choose whichever you wish.)

ANS : There are two options :
1. Write back before putting in victim cache
2. Write back only when evicting from victim cache

If we choose the former, we do not have to do anything special when evicting from victim cache, since the write back happened when the block was brought into victim cache.
For the latter choice (which might be more efficient in batching more writes together), we have to maintain the dirty bit for the Victim cache entries, and evicting a Victim cache entry would mean writing it back to L2 cache.