# Memory & Caches III

CSE 351 Winter 2021

| Teaching | <b>Assistants:</b> |
|----------|--------------------|

Mark Wyse

**Instructor:** 

Kyrie DowlingCatherine GuevaraIan HsiaoJim LimprasertArmin MagnessAllie PflegerCosmo WangRonald Widjaja



http://xkcd.com/908/

# Administrivia

- Lab 3 due Monday 2/22
- hw16 due Monday (2/22)
  - Covers the major cache mechanics
- hw17 due Wednesday (2/24)
  - Preparation for Lab 4
- Cache simulator:
  - https://courses.cs.washington.edu/courses/cse351/cachesim/

# Making memory accesses fast!

- Cache basics
- Principle of locality
- Memory hierarchies
- Cache organization
  - Direct-mapped (sets; index + tag)
  - Associativity (ways)
  - Replacement policy
  - Handling writes
- Program optimizations that consider caches

### **Reading Review**

- Terminology:
  - Associativity: sets, fully-associative cache
  - Replacement policies: least recently used (LRU)
  - Cache line: cache block + management bits (valid, tag)
  - Cache misses: compulsory, conflict, capacity
- Questions from the Reading?

#### **Review: Direct-Mapped Cache**



#### **Direct-Mapped Cache Problem**



# Associativity

- What if we could store data in any place in the cache? \*
  - More complicated hardware = more power consumed, slower
- So we combine the two ideas:
  - Each address maps to exactly one set
  - Each set can store block in more than one way



direct-mapped

# **Cache Organization (3)**

**Note:** The textbook uses "b" for offset bits

- \* Associativity (E): # of ways for each set
  - Such a cache is called an "E-way set associative cache"
  - We now index into cache *sets*, of which there  $\arg S = C/K/E$
  - Use lowest  $\log_2(C/K/E) = s$  bits of block address  $C = S^* E^* K$ 
    - <u>Direct-mapped</u>: E = 1, so  $s = \log_2(C/K)$  as we saw previously 5= LOB2 ( EK)
    - <u>Fully associative</u>: E = C/K, so s = 0 bits



4

# **Example Placement**



- А Where would data from address 0x1833 be placed?
  - Binary: 0b 0001 1000 0011 0011

12 *m*-bit address: Tag (*t*)

t = m - s - k  $s = \log_2(C/K/E)$   $k = \log_2(K) = 4$ Offset (k) Index (S)

s = ? 2s = ? 3 **s** = ? **Direct-mapped** 2-way set associative 4-way set associative Set Tag Data Set Tag Data Set Tag Data 0 0 1 0 2 1 3 Α 4 2 4 2 5 1 6 3 7

# **Block Placement and Replacement**

- Any empty block in the correct set may be used to store block
  - Valid bit for each cache block indicates if data is valid (1) or garbage (0)
- If there are no empty blocks, which one should we replace?
  - No choice for direct-mapped caches
  - Caches typically use something close to *least recently used (LRU)* (hardware usually implements "not most recently used")



# **Polling Questions**

✤ We have a cache of size 2 KiB with block size of 128 B. If our cache has 2 sets, what is its associativity?

C = S'E'K

 $E = \frac{C}{5 \text{ K}} = \frac{2^{"}}{2 \text{ * } 2^{7}} = 2^{3} = 8$ 

Vote in Ed Lessons

| Α. | 2 |
|----|---|
|    |   |

**B.** 4

- - D. 16
  - E. We're lost...
- $s = los_2(z) = l$  t = m s k = l6 l 7 = [8]If addresses are 16 bits wide, how wide is the Tag field?

# **General Cache Organization (***S*, *E*, *K***)**



#### **Notation Review**

- We just introduced a lot of new variable names!
  - Please be mindful of block size notation when you look at past exam questions or are watching videos

| Parameter          | Variable                     | Formulas                                                                          |
|--------------------|------------------------------|-----------------------------------------------------------------------------------|
| Block size         | K (B in book)                |                                                                                   |
| Cache size         | С                            | $M = 2^m \land m = \log M$                                                        |
| Associativity      | Ε                            | $M = 2^{m} \leftrightarrow m = \log_2 M$ $S = 2^{s} \leftrightarrow s = \log_2 S$ |
| Number of Sets     | S                            | $K = 2^k \leftrightarrow k = \log_2 K$                                            |
| Address space      | М                            | $C = K \times E \times S$                                                         |
| Address width      | m                            | $c = K \times E \times S =$<br>$s = \log_2(C/K/E)$                                |
| Tag field width    | t                            | $m = t + s + k \not\rightarrow$                                                   |
| Index field width  | S                            |                                                                                   |
| Offset field width | <b>k</b> ( <b>b</b> in book) |                                                                                   |

#### **Example Cache Parameters Problem**

\* 4 KiB address space, 125 cycles to go to memory. Fill in the following table:  $M = 4 \text{ KrB} \rightarrow 2^2 + 2^{10} = 2^{17} \text{ B}$   $M = 4 \text{ KrB} \rightarrow 2^2 + 2^{10} = 2^{17} \text{ B}$  $M = 4 \text{ KrB} \rightarrow 2^2 + 2^{10} = 2^{17} \text{ B}$ 

| 5=         | ٤. | 2.                 |
|------------|----|--------------------|
| <b>.</b> - | EK | 2 4 25             |
|            | 5= | 2 <sup>2</sup> = 4 |

| Cache Size         | 256 B        | C = 5           |
|--------------------|--------------|-----------------|
| Block Size         | 32 B         | <b>K=</b>       |
| Associativity      | 2-way        | E               |
| Hit Time           | 3 cycles     | HT              |
| Miss Rate          | 20%          | MR              |
| Tag Bits           | 5            | t= 12           |
| Index Bits         | 5= 10g2 4= 2 | 5 = 12          |
| <b>Offset Bits</b> | 5            |                 |
| AMAT               | 28 cjeks     | = HT +<br>= 3 - |
|                    |              | - 3             |

| $c = s^* E^* K$     |
|---------------------|
| K => k = log2 32= 5 |
| E                   |
| HT                  |
| MR 2 -              |
| t = 12 - 8 - 5 = 5  |
| s= 12-t-5           |
|                     |
| = HT + MR MP        |
| = 3 + (.2)(175)     |
| 2 4 75              |

Locate set

Check if any line in set

1)

2)

#### **Cache Read**



Direct-mapped: One line per set Block Size K = 8 B



Direct-mapped: One line per set Block Size K = 8 B



Direct-mapped: One line per set Block Size K = 8 B



No match? Then old line gets evicted and replaced

Direct-mapped: One line per set Block Size K = 8 B



No match? Then old line gets evicted and replaced

#### **Example: Set-Associative Cache (***E* = 2**)**



• • •



# **Example: Set-Associative Cache (***E* = 2**)**



21

# **Example:** Set-Associative Cache (E = 2)

Generalizes to any E!

2-way: Two lines per set Block Size K = 8 B





short int (2 B) is here

#### No match?

- ٠
- •

Replacement policies: random, least recently used (LRU), ... Roud - Robin Pseudo - LRU

# **Types of Cache Misses: 3 C's!**

- Compulsory (cold) miss
  - Occurs on first access to a block
- Conflict miss
  - Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot
    - e.g., referencing blocks 0, 8, 0, 8, ... could miss every time
  - Direct-mapped caches have more conflict misses than *E*-way set-associative (where *E* > 1)
- Capacity miss
  - Occurs when the set of active cache blocks (the *working set*) is larger than the cache (just won't fit, even if cache was *fully-associative*)
  - **Note:** *Fully-associative* only has Compulsory and Capacity misses

# Example Code Analysis Problem

- Assuming the cache starts <u>cold</u> (all blocks invalid) and sum, i, and j are stored in registers, calculate the **miss rate**:
  - m = 12 bits, C = 256 B, K = 32 B, E = 2

for (int j = 0; j < SIZE; j++)</pre>

long ar[SIZE][SIZE], sum = 0; // &ar=0x800

t = 5 bits, s = 2 bits, k = 5 bits

#define SIZE 8

ar is array of 64 elements arranged in 8 rows and 8 columns, stored in row-major order Miss rate = 1/4

sum += ar[i][j];

**for** (**int** i = 0; i < SIZE; i++)

Explanation: Each cache block (32B) holds 4 array entries (long = 8B). Thus, access to ar[0][0] misses, but the loaded block will hold ar[0][0], ar[0][1], ar[0][2], and ar[0][3]. Since the loop ordering iterates columns in the inner-loop, the next three accesses to ar[0][1-3] hit. The access to ar[0][4] misses, but loads the next three elements, too. The pattern repeats with a miss followed by three hits, so the miss rate is one miss per four accesses.

Note: For this specific code, because no element is accessed a second time, it doesn't matter where the blocks get placed in the cache.

Challenge Question: What is the miss rate if we switch the ordering of the loops? (hint: where the blocks are placed in the cache matters now!)