



### How does execution time grow with SIZE?

```
int array[SIZE];
int A = 0;

for (int i = 0 ; i < 2000000 ; ++ i) {
   for (int j = 0 ; j < SIZE ; ++ j) {
        A += array[j];
   }
}
TIME
Plot

Numn 2013</pre>
SIZE
```

University of Washington

### **Actual Data**

Autumn 2013



## Making memory accesses fast!

- Cache basics
- Principle of locality
- Memory hierarchies
- Cache organization
- Program optimizations that consider caches

#### **Problem: Processor-Memory Bottleneck**



University of Washington

#### Cache

- English definition: a hidden storage space for provisions, weapons, and/or treasures
- CSE definition: computer memory with short access time used for the storage of frequently or recently used instructions or data (i-cache and d-cache)

more generally,

used to optimize data transfers between system elements with different characteristics (network interface cache, I/O cache, etc.)

#### **Problem: Processor-Memory Bottleneck**



University of Washingt

#### **General Cache Mechanics**



Autumn 2013 Memory and Caches 7 Autumn 2013 Memory and Caches

## **General Cache Concepts: Hit**



## **Why Caches Work**

■ Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

## **General Cache Concepts: Miss**



Memory and Caches

### **Why Caches Work**

Autumn 2013

■ Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

- Temporal locality:
  - Recently referenced items are likely to be referenced again in the near future



Why is this important?

Autumn 2013 Memory and Caches Autumn 2013 Memory and Caches

### **Why Caches Work**

- Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
- **■** Temporal locality:
  - Recently referenced items are likely to be referenced again in the near future



Spatial locality?

Autumn 2013

Memory and Caches

13

### **Why Caches Work**

- Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
- Temporal locality:
  - Recently referenced items are likely to be referenced again in the near future



- Spatial locality:
  - Items with nearby addresses tend to be referenced close together in time



How do caches take advantage of this?

Autumn 2013 Memory and Caches

### **Example: Locality?**

- Data:
  - Temporal: sum referenced in each iteration
  - Spatial: array a [] accessed in stride-1 pattern
- Instructions:
  - Temporal: cycle through loop repeatedly
  - Spatial: reference instructions in sequence
- Being able to assess the locality of code is a crucial skill for a programmer

## **Locality Example #1**

```
int sum_array_rows(int a[M][N])
{
   int i, j, sum = 0;

   for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
        sum += a[i][j];
   return sum;
}</pre>
```

```
a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
a[2][0] a[2][1] a[2][2] a[2][3]
```

 Autumn 2013
 Memory and Caches
 15
 Autumn 2013
 Memory and Caches

### **Locality Example #1**

```
int sum_array_rows(int a[M][N])
{
    int i, j, sum = 0;

    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];
    return sum;
}</pre>
```

```
a[0][0]
          a[0][1]
                    a[0][2]
                                a[0][3]
a[1][0]
          a[1][1]
                    a[1][2]
                               a[1][3]
a[2][0]
          a[2][1]
                    a[2][2]
                               a[2][3]
                1: a[0][0]
                2: a[0][1]
                3: a[0][2]
                4: a[0][3]
                5: a[1][0]
                6: a[1][1]
                7: a[1][2]
                8: a[1][3]
                9: a[2][0]
                10: a[2][1]
               11: a[2][2]
               12: a[2][3]
```

#### stride-1

Autumn 2013 Memory and Caches

#### Autumn 2013

Memory and Caches

## **Locality Example #2**

```
int sum_array_cols(int a[M][N])
{
    int i, j, sum = 0;

    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}</pre>
```

```
a[0][0]
         a[0][1]
                    a[0][2]
                               a[0][3]
a[1][0]
         a[1][1]
                    a[1][2]
                               a[1][3]
a[2][0]
          a[2][1]
                    a[2][2]
                               a[2][3]
                1: a[0][0]
                2: a[1][0]
                3: a[2][0]
                4: a[0][1]
                5: a[1][1]
                6: a[2][1]
                7: a[0][2]
                8: a[1][2]
                9: a[2][2]
               10: a[0][3]
               11: a[1][3]
```

#### stride-N

12: a[2][3]

### **Locality Example #2**

```
int sum_array_cols(int a[M][N])
{
    int i, j, sum = 0;
    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}</pre>
```

```
a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
a[2][0] a[2][1] a[2][2] a[2][3]
```

Memory and Cadles 16

## **Locality Example #3**

- What is wrong with this code?
- How can it be fixed?

 Autumn 2013
 Memory and Caches
 19
 Autumn 2013
 Memory and Caches
 2

#### **Cost of Cache Misses**

- Huge difference between a hit and a miss
  - Could be 100x, if just L1 and main memory
- Would you believe 99% hits is twice as good as 97%?
  - Consider:
     Cache hit time of 1 cycle
     Miss penalty of 100 cycles

cycle = single fixed-time machine step

Autumn 2013 Memory and Caches

#### **Cache Performance Metrics**

- Miss Rate
  - Fraction of memory references not found in cache (misses / accesses)
     = 1 hit rate
  - Typical numbers (in percentages):
    - 3% 10% for L1
    - Can be quite small (e.g., < 1%) for L2, depending on size, etc.</li>
- Hit Time
  - Time to deliver a line in the cache to the processor
    - Includes time to determine whether the line is in the cache
  - Typical hit times: 1 2 clock cycles for L1; 5 20 clock cycles for L2
- Miss Penalty
  - Additional time required because of a miss
  - Typically 50 200 cycles for L2 (trend: increasing!)

#### **Cost of Cache Misses**

- Huge difference between a hit and a miss
  - Could be 100x, if just L1 and main memory
- Would you believe 99% hits is twice as good as 97%?
  - Consider:
     Cache hit time of 1 cycle
     Miss penalty of 100 cycles

cycle = single fixed-time machine step

- Average access time: \_ check the cache every time
  - 97% hits: 1 cycle + 0.03 \* 100 cycles = 4 cycles
     99% hits: 1 cycle + 0.01 \* 100 cycles = 2 cycles
- This is why "miss rate" is used instead of "hit rate"

Autumn 2013 Memory and Caches 22

#### Can we have more than one cache?

Why would we want to do that?

 Autumn 2013
 Memory and Caches
 23
 Autumn 2013
 Memory and Caches
 24

- Faster storage technologies almost always cost more per byte and have lower capacity
- The gaps between memory technology speeds are widening
  - True for: registers ↔ cache, cache ↔ DRAM, DRAM ↔ disk, etc.
- Well-written programs tend to exhibit good locality
- These properties complement each other beautifully
- They suggest an approach for organizing memory and storage systems known as a memory hierarchy

Autumn 2013 Memory and Caches 25

University of Washington

### **An Example Memory Hierarchy**



**An Example Memory Hierarchy** 



University of Washing

#### **Memory Hierarchies**

- Fundamental idea of a memory hierarchy:
  - For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.
- Why do memory hierarchies work?
  - Because of locality, programs tend to access the data at level k more often than they access the data at level k+1.
  - Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.
- Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.

## **Intel Core i7 Cache Hierarchy**



#### Where should we put data in the cache?



## Where should we put data in the cache?

Autumn 2013



### Use tags to record which location is cached



Autumn 2013 Memory and Caches

32

## What's a cache block? (or cache line)



## Problems with direct mapped caches?

direct mapped:

Autumn 2013

- Each memory address can be mapped to exactly one index in the cache.
- What happens if a program uses addresses 2, 6, 2, 6, 2, ...?
- conflict



### A puzzle.

- What can you infer from this:
- Cache starts empty
- Access (addr, hit/miss) stream:

Autumn 2013 Memory and Caches

## **Associativity**

■ What if we could store data in *any* place in the cache?

35

#### **Associativity**

Autumn 2013

- What if we could store data in *any* place in the cache?
- That might slow down caches (more complicated hardware), so we do something in between.
- Each address maps to exactly one set.



## What's a cache block? (or cache line)



## Now how do I know where data goes?



Autumn 2013 Memory and Caches

## Now how do I know where data goes?



Our example used a 2<sup>2</sup>-block cache with 2<sup>1</sup> bytes per block. Where would 13 (1101) be stored?



#### Example placement in set-associative caches

- Where would data from address 0x1833 be placed?
  - Block size is 16 bytes.
- 0x1833 in binary is 00...0110000 011 0011.



## **Block replacement**

- Any empty block in the correct set may be used for storing data.
- If there are no empty blocks, which one should we replace?



Memory and Caches

#### Example placement in set-associative caches

- Where would data from address 0x1833 be placed?
  - Block size is 16 bytes.
- 0x1833 in binary is 00...0110000 011 0011.



42

## **Block replacement**

Replace something, of course, but what?



Autumn 2013

## **Block replacement**

- Replace something, of course, but what?
  - Obvious for direct-mapped caches, what about set-associative?



Autumn 2013 Memory and Caches

45

### Another puzzle.

- What can you infer from this:
- Cache starts *empty*
- Access (addr, hit/miss) stream
- (10, miss); (12, miss); (10, miss)

  12 is not in the same

  12's block replaced 10's block block as 10

direct-mapped cache

## **Block replacement**

- Replace something, of course, but what?
  - Caches typically use something close to least recently used (LRU)
  - (hardware usually implements "not most recently used")



46
Autumn 2013 Memory and Caches

## General Cache Organization (S, E, B)





## **Example: Direct-Mapped Cache (E = 1)**

Direct-mapped: One line per set Assume: cache block size 8 bytes



## **Example: Direct-Mapped Cache (E = 1)**

Direct-mapped: One line per set Assume: cache block size 8 bytes



Autumn 2013 Memory and Caches

## **Example: Direct-Mapped Cache (E = 1)**

Direct-mapped: One line per set Assume: cache block size 8 bytes



No match: old line is evicted and replaced

 Autumn 2013
 Memory and Caches
 51
 Autumn 2013
 Memory and Caches
 5

## Example (for E = 1)

int i, j;

int i, j;

return sum;

double sum = 0;

double sum = 0;

int sum array rows (double a[16][16])

for (j = 0; j < 16; j++)

sum += a[i][j];

int sum array cols(double a[16][16])

for (j = 0; j < 16; j++)

for (i = 0; i < 16; i++)

Assume sum, i, j in registers Address of an aligned element of a: aa...ayyyyxxxx000

Assume: cold (empty) cache 3 bits for set, 5 bits for offset

aa...ayyy yxx xx000

0,9: aa...a000 000 00000



for (i = 0; i < 16; i++)32 B = 4 doubles

32 B = 4 doubles

4 misses per row of array 4\*16 = 64 misses

0 1 2 3 4 5 6 7

every access a miss 16\*16 = 256 misses

Memory and Caches Autumn 2013

sum += a[i][j];

Autumn 2013

Example (for E = 1)

float sum = 0;

return sum;

int i;

if x and y have aligned

starting addresses,

e.g., &x[0] = 0, &y[0] = 128

float dotprod(float x[8], float y[8])

for (i = 0; i < 8; i++)

sum += x[i]\*y[i];

In this example, cache blocks are 16 bytes; 8 sets in cache How many block offset bits? How many set index bits?

Address bits: ttt....t sss bbbb B = 16 = 2b: b=4 offset bits  $S = 8 = 2^s$ : s=3 index bits

0: 000....0 000 0000 128: 000....1 000 0000 160: 000....1 010 0000



# E-way Set-Associative Cache (Here: E = 2)

E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: 0...01 100 t bits tag 0 1 2 3 4 5 6 7 tag 0 1 2 3 4 5 6 7 find set 0 1 2 3 4 5 6 7 tag 0 1 2 3 4 5 6 7

tag

## E-way Set-Associative Cache (Here: E = 2)

E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0...01 100 compare both valid? + match: yes = hit v tag 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 block offset

Autumn 2013 Memory and Caches Autumn 2013 Memory and Caches

## E-way Set-Associative Cache (Here: E = 2)

## E = 2: Two lines per set



#### No match:

- One line in set is selected for eviction and replacement
- Replacement policies: random, least recently used (LRU), ...

Autumn 2013 Memory and Caches

University of Washington

#### **Types of Cache Misses**

- Cold (compulsory) miss
  - Occurs on first access to a block
- Conflict miss
  - Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot
    - e.g., referencing blocks 0, 8, 0, 8, ... would miss every time
  - direct-mapped caches have more conflict misses than n-way <u>set-associative</u> (where n is a power of 2 and n > 1)
- Capacity miss
  - Occurs when the set of active cache blocks (the working set) is larger than the cache (just won't fit)

## Example (for E = 2)

```
float dotprod(float x[8], float y[8])
{
    float sum = 0;
    int i;

    for (i = 0; i < 8; i++)
        sum += x[i]*y[i];
    return sum;
}</pre>
```

If x and y have aligned starting addresses, e.g. &x[0] = 0, &y[0] = 128, can still fit both because two lines in each set

| x[0] | x[1] | x[2] | x[3] | y[0] | y[1] | y[2] | y[3] |
|------|------|------|------|------|------|------|------|
| x[4] | x[5] | x[6] | x[7] | y[4] | y[5] | y[6] | y[7] |
|      |      |      |      |      |      |      |      |
|      |      |      |      |      |      |      |      |

Autumn 2013 Memory and Caches

#### What about writes?

- Multiple copies of data exist:
  - L1, L2, possibly L3, main memory
- What is the main problem with that?

 Autumn 2013
 Memory and Caches
 59
 Autumn 2013
 Memory and Caches
 66

#### What about writes?

- Multiple copies of data exist:
  - L1, L2, possibly L3, main memory
- What to do on a write-hit?
  - Write-through: write immediately to memory, all caches in between.
  - Write-back: defer write to memory until line is evicted (replaced)
    - Need a *dirty bit* to indicate if line is different from memory or not
- What to do on a write-miss?
  - Write-allocate: load into cache, update line in cache.
    - Good if more writes or reads to the location follow
  - No-write-allocate: just write immediately to memory.
- Typical caches:
  - Write-back + Write-allocate, usually why?
  - Write-through + No-write-allocate, occasionally

Autumn 2013 Memory and Caches

## Write-back, write-allocate example

mov 0xFACE, T mov 0xFEED, T mov U, %rax





## Write-back, write-allocate example

mov 0xFACE, T





Autumn 2013 Memory and Caches 62

## Write-back, write-allocate example

mov 0xFACE, T mov 0xFEED, T mov U, %rax

Cache U 0xBEEF 0 dirty bit



### Back to the Core i7 to look at ways



### **Software Caches are More Flexible**

#### Examples

File system buffer caches, web browser caches, etc.

#### Some design differences

- Almost always fully-associative
  - so, no placement restrictions
  - index structures like hash tables are common (for placement)
- Often use complex replacement policies
  - misses are very expensive when disk or network involved
  - worth thousands of cycles to avoid them
- Not necessarily constrained to single "block" transfers
  - may fetch or write-back in larger units, opportunistically

### Where else is caching used?

Autumn 2013 Memory and Caches 66

### **Optimizations for the Memory Hierarchy**

- Write code that has locality!
  - Spatial: access data contiguously
  - Temporal: make sure access to the same data is not too far apart in time
- How can you achieve locality?
  - Proper choice of algorithm
  - Loop transformations

 Autumn 2013
 Memory and Caches
 67
 Autumn 2013
 Memory and Caches
 66

### **Example: Matrix Multiplication**

$$(AB)_{ij} = \sum_{k=1}^{m} A_{ik} B_{kj}.$$

#### memory access pattern?

Autumn 2013 Memory and Caches

## **Cache Miss Analysis**

#### Assume:

- Matrix elements are doubles
- Cache block = 64 bytes = 8 doubles
- Cache size C << n (much smaller than n)</li>

#### Other iterations:

Again: n/8 + n = 9n/8 misses (omitting matrix c)



#### Total misses:

■ 9n/8 \* n² = (9/8) \* n³

once per element

#### **Cache Miss Analysis**

spatial locality: chunks of 8 items in a row in same cache line

n/8 misses

each item in column in

different cache line

#### Assume:

- Matrix elements are doubles
- Cache block = 64 bytes = 8 doubles
- Cache size C << n (much smaller than n, not left-shifted by n)</li>

#### First iteration:

- n/8 + n = 9n/8 misses (omitting matrix c)
- Afterwards in cache: (schematic)



Autumn 2013

Memory and Caches

=

## **Blocked Matrix Multiplication**



#### **Cache Miss Analysis**

- Assume:
  - Cache block = 64 bytes = 8 doubles
  - Cache size C << n (much smaller than n)</li>
  - Three blocks fit into cache: 3B<sup>2</sup> < C



### **Summary**

- No blocking: (9/8) \* n³
- Blocking: 1/(4B) \* n<sup>3</sup>
- If B = 8 difference is 4 \* 8 \* 9 / 8 = 36x
- If B = 16 difference is 4 \* 16 \* 9 / 8 = 72x
- Suggests largest possible block size B, but limit 3B² < C!
- Reason for dramatic difference:
  - Matrix multiplication has inherent temporal locality:
    - Input data: 3n<sup>2</sup>, computation 2n<sup>3</sup>
    - Every array element used O(n) times!
  - But program has to be written properly

#### **Cache Miss Analysis**

#### Assume:

- Cache block = 64 bytes = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Three blocks fit into cache: 3B<sup>2</sup> < C</p>

#### Other (block) iterations:

- Same as first iteration
- 2n/B \* B<sup>2</sup>/8 = nB/4



#### Total misses:

Autumn 2013

 $nB/4 * (n/B)^2 = n^3/(4B)$ 

Memory and Caches

n/B blocks

## **Cache-Friendly Code**

#### Programmer can optimize for cache performance

- How data structures are organized
- How data are accessed
  - Nested loop structure
  - Blocking is a general technique

#### All systems favor "cache-friendly code"

- Getting absolute optimum performance is very platform specific
  - Cache sizes, line sizes, associativities, etc.
- Can get most of the advantage with generic code
  - Keep working set reasonably small (temporal locality)
  - Use small strides (spatial locality)
  - Focus on inner loop code

 Autumn 2013
 Memory and Caches
 75
 Autumn 2013
 Memory and Caches
 76

## **Intel Core i7 Cache Hierarchy**



#### L1 i-cache and d-cache:

32 KB, 8-way, Access: 4 cycles

#### L2 unified cache:

256 KB, 8-way, Access: 11 cycles

#### L3 unified cache:

8 MB, 16-way, Access: 30-40 cycles

Block size: 64 bytes for

