## The Hardware/Software Interface

CSE351 Winter 2013

### **Memory and Caches II**

## General Cache Organization (S, E, B)



## **Types of Cache Misses**

#### Cold (compulsory) miss

Occurs on very first access to a block

#### Conflict miss

- Occurs when some block is evicted out of the cache, but then that block is referenced again later
- Conflict misses occur when the cache is large enough, but multiple data blocks all map to the same slot
  - e.g., if blocks 0 and 8 map to the same cache slot, then referencing 0, 8, 0, 8, ... would miss every time
  - Conflict misses may be reduced by increasing the associativity of the cache

#### Capacity miss

 Occurs when the set of active cache blocks (the working set) is larger than the cache (just won't fit)

Winter 2013 Memory and Caches II



2

## Example: Direct-Mapped Cache (E = 1)

Direct-mapped: One line per set Assume: cache block size 8 bytes



Memory and Caches II Winter 2013 Memory and Caches II

## **Example: Direct-Mapped Cache (E = 1)**

Direct-mapped: One line per set Assume: cache block size 8 bytes

Winter 2013



No match: old line is evicted and replaced

## Example: Direct-Mapped Cache (E = 1)

Direct-mapped: One line per set Assume: cache block size 8 bytes



Assume sum, i, j in registers

Address of an aligned element Example (for E = 1) of a: aa...ayyyyxxxx000 Assume: cold (empty) cache

```
int sum_array_rows(double a[16][16])
                                                  3 bits for set, 5 bits for offset
                                                      аа...аууу <u>ужж</u> <u>жж000</u>
    int i, j;
                                                  9,0:aa...a000 000 00000
    double sum = 0;
    for (i = 0; i < 16; i++)
                                                                     0,0 0,1 0,2 0,3
                                                   0,0 0,1 0,2 0,3
         for (j = 0; j < 16; j++)
                                                  0,4 0,5 0,6 0,7
             sum += a[i][j];
                                                  0,8 0,9 0,a 0,b
    return sum;
                                                  0,c 0,d 0,e 0,f
                                                  1,0 1,1 1,2 1,3
                                                                     3,0 3,1 3,2 3,3
int sum_array_cols(double a[16][16])
                                                  1,4 1,5 1,6 1,7
    int i, j;
                                                  1,8 1,9 1,a 1,b
    double sum = 0;
                                                   1,c 1,d 1,e 1,f
    for (j = 0; j < 16; j++)
         for (i = 0; i < 16; i++)
                                                 32 B = 4 doubles
                                                                   32 B = 4 doubles
              sum += a[i][j];
    return sum:
                                               4 misses per row of array every access a miss
                                                  4*16 = 64 misses
                                                                   16*16 = 256 misses
```

## Example (for E = 1)

```
float dotprod(float x[8], float y[8])
{
    float sum = 0;
    int i;

    for (i = 0; i < 8; i++)
        sum += x[i]*y[i];
    return sum;
}</pre>
```

In this example, cache blocks are 16 bytes; 8 sets in cache How many block offset bits? How many set index bits?

Address bits: ttt....t sss bbbb B = 16 = 2<sup>b</sup>: b=4 offset bits S = 8 = 2<sup>s</sup>: s=3 index bits

0: 000....0 000 0000 128: 000....1 000 0000 160: 000....1 010 0000

if x and y have aligned starting addresses, e.g., &x[0] = 0, &y[0] = 128



if x and y have unaligned starting addresses, e.g., &x[0] = 0, &y[0] = 160



Winter 2013

9

University of Washing

## E-way Set-Associative Cache (Here: E = 2)



## E-way Set-Associative Cache (Here: E = 2)



Winter 2013 Memory and Caches II 10

## E-way Set-Associative Cache (Here: E = 2)



#### No match:

- One line in set is selected for eviction and replacement
- · Replacement policies: random, least recently used (LRU), ...

11 Winter 2013 Memory and Caches II 12

#### Offiversity of washing

## Example (for E = 2)

```
float dotprod(float x[8], float y[8])
{
    float sum = 0;
    int i;

    for (i = 0; i < 8; i++)
        sum += x[i]*y[i];
    return sum;
}</pre>
```

If x and y have aligned starting addresses, e.g. &x[0] = 0, &y[0] = 128, can still fit both because two lines in each set

| x[0] | x[1] | x[2] | x[3] | y[0] | y[1] | y[2] | y[3 |
|------|------|------|------|------|------|------|-----|
| x[4] | x[5] | x[6] | x[7] | y[4] | y[5] | y[6] | y[7 |
|      |      |      |      |      |      |      |     |
|      |      |      |      |      |      |      |     |

Winter 2013 Memory and Caches II

Winter 2013

Memory and Caches II

14

16

## **Intel Core i7 Cache Hierarchy**

# Processor package Core 0 Regs

L2 unified cache

i-cache

d-cache

L1 i-cache and d-cache:

32 KB, 8-way, Access: 4 cycles 13

L2 unified cache:

256 KB, 8-way, Access: 11 cycles

L3 unified cache:

8 MB, 16-way, Access: 30-40 cycles

**Block size**: 64 bytes for all caches.

Main memory

L3 unified cache

(shared by all cores)

Memory and Caches II

Core 3

Regs

d-cache

L2 unified cache

L1

i-cache

15

#### anocate, occasionany

## Fully Set-Associative Caches (S = 1)

- Fully-associative caches have all lines in one single set, S = 1
  - E = C / B, where C is total cache size
  - Since, S = (C/B)/E, therefore, S = 1
- Direct-mapped caches have E = 1
  - S = (C/B)/E = C/B
- Tag matching is more expensive in associative caches
  - Fully-associative cache needs C / B tag comparators: one for every line!
  - Direct-mapped cache needs just 1 tag comparator
  - In general, an E-way set-associative cache needs E tag comparators
- Tag size, assuming m address bits (m = 32 for IA32):
  - m log<sub>2</sub>S log<sub>2</sub>B

## What about writes?

- Multiple copies of data exist:
  - L1, L2, possibly L3, main memory
- What to do on a write-hit?
  - Write-through (write immediately to memory)
  - Write-back (defer write to memory until line is evicted)
    - Need a dirty bit to indicate if line is different from memory or not
- What to do on a write-miss?
  - Write-allocate (load into cache, update line in cache)
    - Good if more writes to the location follow
  - No-write-allocate (just write immediately to memory)
- Typical caches:
  - Write-back + Write-allocate, usually
  - Write-through + No-write-allocate, occasionally

Memory and Caches II

#### University of Washingt

## **Software Caches are More Flexible**

#### Examples

• File system buffer caches, web browser caches, etc.

#### Some design differences

- Almost always fully-associative
  - so, no placement restrictions
  - index structures like hash tables are common (for placement)
- Often use complex replacement policies
  - misses are very expensive when disk or network involved
  - worth thousands of cycles to avoid them
- Not necessarily constrained to single "block" transfers
  - may fetch or write-back in larger units, opportunistically

Winter 2013 Memory and Caches II

17

## **Example: Matrix Multiplication**



## **Optimizations for the Memory Hierarchy**

#### Write code that has locality

- Spatial: access data contiguously
- Temporal: make sure access to the same data is not too far apart in time

#### How to achieve?

- Proper choice of algorithm
- Loop transformations

Winter 2013 Memory and Caches II 18

## **Cache Miss Analysis**

#### Assume:

- Matrix elements are doubles
- Cache block = 64 bytes = 8 doubles
- Cache size C << n (much smaller than n)</li>

#### First iteration:

n/8 + n = 9n/8 misses (omitting matrix c)



 Afterwards in cache: (schematic)



tter 2013 Memory and Caches II 19 Winter 2013 Memory and Caches II 20

## **Cache Miss Analysis**

#### Assume:

- Matrix elements are doubles
- Cache block = 64 bytes = 8 doubles
- Cache size C << n (much smaller than n)</li>

#### Other iterations:

Again: n/8 + n = 9n/8 misses (omitting matrix c)



#### Total misses:

 $9n/8 * n^2 = (9/8) * n^3$ 

Winter 2013 Memory and Caches II 21

## **Cache Miss Analysis**

#### Assume:

- Cache block = 64 bytes = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Three blocks fit into cache: 3B<sup>2</sup> < C



## **Blocked Matrix Multiplication**



Winter 2013 Memory and Caches II

## **Cache Miss Analysis**

#### Assume:

- Cache block = 64 bytes = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Three blocks fit into cache: 3B<sup>2</sup> < C

#### Other (block) iterations:

- Same as first iteration
- 2n/B \* B<sup>2</sup>/8 = nB/4



22

n/B blocks

#### Total misses:

 $nB/4 * (n/B)^2 = n^3/(4B)$ 

Memory and Caches II 24

/inter 2013

**Summary** 

■ No blocking: (9/8) \* n³

■ Blocking: 1/(4B) \* n³

■ If B = 8 difference is 4 \* 8 \* 9 / 8 = 36x

■ If B = 16 difference is 4 \* 16 \* 9 / 8 = 72x

- Suggests largest possible block size B, but limit 3B<sup>2</sup> < C!
- Reason for dramatic difference:
  - Matrix multiplication has inherent temporal locality:
    - Input data: 3n<sup>2</sup>, computation 2n<sup>3</sup>
    - Every array element used O(n) times!
  - But program has to be written properly

Winter 2013 Memory and Caches II



**Cache-Friendly Code** 

#### Programmer can optimize for cache performance

- How data structures are organized
- How data are accessed
  - Nested loop structure
  - Blocking is a general technique

#### All systems favor "cache-friendly code"

- Getting absolute optimum performance is very platform specific
  - Cache sizes, line sizes, associativities, etc.
- Can get most of the advantage with generic code
  - Keep working set reasonably small (temporal locality)
  - Use small strides (spatial locality)
  - Focus on inner loop code

25

Winter 2013 Memory and Caches II 26