CSE 378 Homework #6 Design Space

HW6 Task 5 - Exporing the Design Space

Supplement to the hw6 main assignment.
Overview
Your goal as architect is to design I- and D-caches that maximize the performance of your processor but keep within some cost budget. In reality, "costs" would be some combination of area, time, and power requirements. Because that level of detail is out of the question for this assignment, we approximate it with the following simplified situation:

For us "cost" is just a unitless quantity.
Beginning from the baseline cache configuration you are allowed to modify any of the cache design decisions. (For example, you can change the blocksize.)
Each such change entails a given cost.
You have a total of 100 credits, so can make any combination of changes that do not cost more than 100 in total.
Given that budget constraint, your goal is to find the set of design decisions that maximize performance.
A little ways below on this page we list some of the design choices you could explore, and the cost of each. You can also consider making any other changes you have reason to believe might be good ideas (e.g., something you've read about in the book but isn't on our list). Send mail describing what you want to do and we'll quote you a cost.
Methodology
This assignment has purposefully tried to make the space of different designs that fit within your budget so large that it is infeasible to try evaluating them all. Instead, we intend that you think about the tradeoffs, and use a sequence of experiments to reduce finding the right cache configuration to a more linear process. A typical approach is to examine each dimension of the design space in isolation (i.e., varying only one thing). Choose the dimension that offers the biggest gain and fix that decision. Now iterate. There is no guarantee that what you end up with by this procedure is anywhere near optimal, though, so it's often the case that some tweaking is required as you apply it (i.e., revisiting old decisions every once in a while).
Note that you do not have to come up with the best cache design of everyone in the class (although that's your goal). Follow a reasonable procedure, where "reasonable" is a combination of how effective your approach is likely to be in finding a good design and how much time it takes to carry it out. (As mentioned earlier, an exhaustive search of the design space is high on accuracy of solution but low on efficient use of time.)
Design Alternatives and Costs
Design decisions are made separately for the I- and D-caches. The costs below apply to each cache individually.

Characteristic Change
(relative to baseline) Description Credit Cost Stall Time Cost

Capacity Factor of 2^N We'll measure cache size by the total number of data bytes that could be stored. A change to 512 bytes results in N==2, for instance. 8 * (2^N-1) 0

Linesize Factor of 2^N The cost functions here apply while keeping the cache capcity constant (so increasing the blocksize by a factor of two means decreasing the number of lines by a factor of 2). 0 See the baseline formula)

Critical word first This applies when lines contain more than one data word. When fetching data from memory (due to a miss), the word containing the address is sent first, rather than simply sending them in address order. (For example, if blocks are 8 bytes and address 0x00000105 is referenced, the word at 0x00000104 would be transfered first and then the word at 0x00000100). This reduces the time penalty for a miss from the baseline equation. 8 -(linesize-1)

Fully Associative A fully associative cache can compare an adress with the tags of all cache lines in parallel. Replacement is done by keeping a counter, r. When replacement needs to be done, line r is replaced and r = r + 1 mod L, where L is the number of lines in the cache. 2*L^1.2 0

Set Associative N-way set associative While keeping cache capacity constant, increase the associativity by a factor of 2^N. We'll assume that each set (row) has a counter, s, associated with it. When a replacement needs to be done in that set, replace the line in set s and s = s + 1 mod N. 3*N 0

LRU replacement
(Set-Associative) NA N-way set associative with L rows.
See text, page 504. N 1

Write-back Write back cache Requires a dirty bit per line. When a dirty line is replaced, it must be written back to memory before the new data can be fetched.
See text, pages 484 and 521. 8 0

Write-buffer N slots, fully associative When a write to memory is required (which depends on other design decisions), the address and data to be written are placed in an emply slot in the write buffer, if there is one. Writes are initiated one at a time, in FIFO order, whenever the bus to memory is free. No stalling is required, unless there are no free write buffer slots when one is needed (in which case stalling occurs until some write completes and a buffer slot is freed).
See the text, page 483 (Section 7.2). 5 * N^1.2 0

Victim cache N slots, fully associative This is a small, fully associative "second chance" cache, used in addition to the main cache.
When performing a lookup, both the main and the victim caches are checked. If there is a hit in the victim cache, (a) the cache stalls one cycle, and (b) uses that cycle to swap the victim cache line with one from the main cache. (When the lookup is performed after the 1 cycle stall it will hit in the main cache.)
On a miss in both caches, the replaced line is put in the victim cache, replacing a line (chosen at random) from it. That costs 1 cycle.
5 * N^1.2 1

Characteristic	Change (relative to baseline)	Description	Credit Cost	Stall Time Cost
Capacity	Factor of 2^N	We'll measure cache size by the total number of data bytes that could be stored. A change to 512 bytes results in N==2, for instance.	`8 * (2^N-1)`	0
Linesize	Factor of 2^N	The cost functions here apply while keeping the cache capcity constant (so increasing the blocksize by a factor of two means decreasing the number of lines by a factor of 2).	0	See the baseline formula)
Critical word first		This applies when lines contain more than one data word. When fetching data from memory (due to a miss), the word containing the address is sent first, rather than simply sending them in address order. (For example, if blocks are 8 bytes and address 0x00000105 is referenced, the word at 0x00000104 would be transfered first and then the word at 0x00000100). This reduces the time penalty for a miss from the baseline equation.	`8`	-(linesize-1)
Fully Associative		A fully associative cache can compare an adress with the tags of all cache lines in parallel. Replacement is done by keeping a counter, `r`. When replacement needs to be done, line `r` is replaced and `r = r + 1 mod L`, where `L` is the number of lines in the cache.	`2*L^1.2`	0
Set Associative	N-way set associative	While keeping cache capacity constant, increase the associativity by a factor of `2^N`. We'll assume that each set (row) has a counter, `s`, associated with it. When a replacement needs to be done in that set, replace the line in set `s` and `s = s + 1 mod N`.	`3*N`	0
LRU replacement (Set-Associative)	NA	`N`-way set associative with `L` rows. See text, page 504.	`N`	`1`
Write-back	Write back cache	Requires a dirty bit per line. When a dirty line is replaced, it must be written back to memory before the new data can be fetched. See text, pages 484 and 521.	8	0
Write-buffer	N slots, fully associative	When a write to memory is required (which depends on other design decisions), the address and data to be written are placed in an emply slot in the write buffer, if there is one. Writes are initiated one at a time, in FIFO order, whenever the bus to memory is free. No stalling is required, unless there are no free write buffer slots when one is needed (in which case stalling occurs until some write completes and a buffer slot is freed). See the text, page 483 (Section 7.2).	`5 * N^1.2`	0
Victim cache	N slots, fully associative	This is a small, fully associative "second chance" cache, used in addition to the main cache. When performing a lookup, both the main and the victim caches are checked. If there is a hit in the victim cache, (a) the cache stalls one cycle, and (b) uses that cycle to swap the victim cache line with one from the main cache. (When the lookup is performed after the 1 cycle stall it will hit in the main cache.) On a miss in both caches, the replaced line is put in the victim cache, replacing a line (chosen at random) from it. That costs 1 cycle.	`5 * N^1.2`	1