CSE 586 Spring 2000
Assignment #4
Array A contains 256 elements of 4 bytes each. Its first element is stored at physical address 4096 ( 4 * 1 K).
Array B contains 512 elements of 4 bytes each. Its first element is stored at physical address 8192 ( 8 * 1 K).
Assume that only arrays A and B can be cached in an initially empty physically addressed, physically tagged (i.e., the usual type of cache), direct-mapped, 2 KB cache with an 8 byte block size.
The following loop is then executed:
for (i = 0; i < 256; i++) A[i] = A[i] + B[2*i]; // Load A[i]; Load B[2i]; // Add; Store A[i](a) Assume a write-through write-around policy.
What are the contents of the cache at the end of the loop?
How many bytes will be written to memory?
A lot of people were confused by how cache was addressed in this question.
Note the following :
1. Since the cache line size is 8 bytes, the smallest unit of data transfer into cache from L2cache or memory is 8 bytes. So if we have a miss for A[0], both A[0] and A[1] get fetched into cache.
2. The cache is addressed by the lower bits of the address. However
the address is byte address, and since a cache line can hold 8 bytes, the
lower three bits of the address are used to address bytes inside a cache
line. Since the cache is 2K bytes large, it has 2K/8 = 256 cache lines,
which are addressed by 8 bits. Hence:
bits 0-2 form the "offset", which is
used to address inside a cache line
bits 3 through 10 of the address form
the cache line address.
Bits 11-32 form the TAG. (assuming a
32 bit architecture)
Now, Consider the sequence : (2 iterations of the loop )
load
A[0] -> causes A[0] and A[1] at cache line 0
load B[0] -> *also addresses cache line 0* - so overwrites A[0] &
A[1] above
store A[0] -> Nothing happens to cache (no write allocate) -> 8 bytes are
written (8 is the unit of transfer)
load A[1] -> Accesses the SAME cache line as A[0] So we load A[0] and A[1]
again into line 0
load B[2] ->addresses cache line 1 - load B[2] and B[3] into cache line
1
store A[1] -> Again nothing happens as above.
So the pattern might be obvious
:
At every iteration
of the loop, B[2*i] accesses a new cache line, and A overwrites the cache
lines every two iterations. Since the loop is 256 iterations, B will just
reach cache line 255 when the loop will finish. Since A has been erasing
B half as fast, we would have A in the top half of the cache and B in the
bottom half. Thus the cache contains : A[0]-A[255] (In the top half) and
B[256]- B[511] in the bottom half.
Also, since the cache is write through, the entries in the cache
will always be the freshly written entries.
Since this is write through, we have to write 256 words = 1024
bytes (all of A) back to the next level (L2 cache or memory)
Assuming that the minimum transfer from a cache to a lower level is
a cache line, this translates to 2048 bytes.
(b) Assume a write-back write-allocate policy.
What are the contents of the cache at the end of the loop?
How many bytes will be written to memory?
With write back/write allocate, the pattern above changes in that a
store also needs to allocate before writing if it gets a miss. This means
that the B[0]-B[1] cache line is replaced when A[0] is written, unlike
when A[1] is read in the above case. Also, note that a dirty cache line
is written back only when it is to be replaced.
Since the A[0] store assures that the A[0] - A[1] line is in
cache, the load to A[1] hits in cache this time. The same pattern as above
is now repeated, with A erasing B every two iterations of the loop. But
since no dirty cache line of A is ever overwritten, the no of bytes written
to lower level is 0.
the contents of the cache at the end :
A[0]..A[255] in the lower end : All entries are the *updated* values of
A.
B[256].. B[511] in the higher end.
Problem 2.
As a designer you are asked to evaluate three possible options for an on-chip write-through data cache. Some of the design options (associativity) and performance consequences (miss rate, miss penalty) are described in the table below.
Data cache options | Miss rate | Miss penalty | |
Cache A: Direct-mapped | 8% | 4 cycles | |
Cache B: 2-way set-associative | 4% | 6 cycles | |
Cache C: 4-way set-associative | 2% | 8 cycles |
(a) Assume that Load instructions have a CPI of 1.2 if they hit in the
cache and a CPI of (1.2 + Miss penalty) otherwise.
All other instructions have a CPI of 1.2 . The instruction mix is such
that 20% of the instructions are Loads.
What is the CPI for each configuration (you'll need to keep 3 decimal digits, i.e., compute each CPI as 1.xxx). Which cache would you choose if CPI is the determining factor.
The formula for the calculation is :
CPI = %loads * (miss rate * miss penalty) + 1.2
Data cache options | Miss rate | Miss penalty | CPI |
Cache A: Direct-mapped | 8% | 4 cycles | 1.264 |
Cache B: 2-way set-associative | 4% | 6 cycles | 1.248 |
Cache C: 4-way set-associative | 2% | 8 cycles | 1.232 |
4 way set associative seems to have the best CPI.
(b) Assume now that if the direct-mapped cache is used, the cycle time is 20 ns. If the 2-way set-associative cache is used, the cycle time is 22 ns. If the 4-way set-associative cache is used, the cycle time is 24 ns.
What is the average time per instruction? If average time per instruction is the determining factor, which cache would you choose?
The cycle time is the clock cycle time. Average time per instruction
= CPI * clock cycle time.
Data cache options | Miss rate | Miss penalty | CPI | time per instruction |
ache A: Direct-mapped | 8% | 4 cycles | 1.264 | 1.264 * 20 = 25.28ns |
Cache B: 2-way set-associative | 4% | 6 cycles | 1.248 | 1.248 * 22 = 27.456ns |
Cache C: 4-way set-associative | 2% | 8 cycles | 1.232 | 1.232 * 24 = 29.568ns |
Direct mapped, if this is the criterion.
(c) In the case of the 2-way set-associative cache, the replacement algorithm is LRU. In the case of the 4-way set-associative cache, the replacement algorithm is such that the Most Recently Used (MRU) block (line) is not replaced (the choice of which of the other 3 is replaced is random and is not part of the logic associated with the cache).
Assume further that at context-switch the whole cache is invalidated.
Indicate what bits are needed for each entry, or set of entries, in addition to the usual tag and data components for each of the 3 cache organizations.
For direct mapped :
1 bit to say whether the cache line is valid or not.
For 2way set associative :
1 invalidate bit per entry ; plus
1 bit per set for LRU
For 4 way set associative :
1 invalidate bit per entry
: plus 2 bits for set or 1 bit per entry to say which is MRU.
Problem 3.
Consider a two-level cache hierarchy. The data cache at the first level (L1) is 16 KB, 2-way set-associative, with a block size of 32 bytes. The cache at the second level L2 is 512 KB, 4-way set-associative, and has a block size of 64 bytes.
(a) Show the lengths of the various sub fields (tag,index,displacement) of a 32-bit address as seen:
The offset (or displacement) is decided by cache line size : For L1
it is 5 bits (log 32), for L2 it is 6 bits (log 64)
The index into cache is decided by the no of cache lines :
L1 is 16KB 2way SA, so each set contains 8KB, therefore no of lines = 8
KB / 32 bytes = 256 lines
L2 is 512 KB, 4 way SA, so each set contains 128 KB, so no of lines = 128
KB/ 64 bytes = 2048 lines
Thus La cache is indexed by 8 bits, and L2 cache by 11 bits
The remaining bits form the tag :
L1 tag = (32 - 8 - 5) =19
L2 tag = (32 - 11 - 6) =
15
L1 cache addressing :
19bit tag | 8 bit index | 5 bit displ0 |
L2 cache addressing :
15 tag | 11 bit index | 6 bit displ0 |
What is the size of the tag associated with each victim cache entry?
ANS :
Again split the 32 bits as above : Since the cache is fully associative,
we need no index. We again need a displacement of 5 bits as in L1 cache,
the remaining is tag (27 bits)
Under these conditions explain why read hits would take 1 cycle for
L1 and 2 cycles for the victim cache.
ANS :
You have to detect a read miss in L1 cache on one cycle before you
access victim cache, hence you need two cycles to access victim cache.
When there is a miss in L1 and a miss in the victim cache, the missing block is brought in from L2 (or main memory) and stored in L1. The block replaced in L1 replaces the LRU entry in the victim cache.
How is the tag for the new entry in the victim cache deduced from that
of the replaced block in L1?
ANS :
Since the tag is 27 bits for victim cache, and the displacement bits
are the same, the tag of the victim cache includes the tag of the L1 cache
and the index of the entry in L1 cache.
What happens to the contents of the replaced (LRU) entry of the victim cache if:
ANS : There are two options :
1. Write back before putting in victim cache
2. Write back only when evicting from victim cache
If we choose the former, we do not have to do anything special when
evicting from victim cache, since the write back happened when the block
was brought into victim cache.
For the latter choice (which might be more efficient in batching
more writes together), we have to maintain the dirty bit for the Victim
cache entries, and evicting a Victim cache entry would mean writing it
back to L2 cache.