# Memory & Caches I

CSE 351 Winter 2021

#### Instructor:

Mark Wyse

#### Teaching Assistant:

Kyrie Dowling Catherine Guevara Ian Hsiao Jim Limprasert Armin Magness Allie Pfleger Cosmo Wang Ronald Widjaja



# Administrivia

- hw14 due Friday (2/12), hw15 due Wednesday (2/17)
  - UW holiday Monday 2/15
- Lab 3 released today, due Monday (2/22)
  - Make sure to look at section slides for this week
- Mid-quarter Survey out today, until Friday 2/19
  - feedback will help us improve the course!
  - on Canvas
- Study Guide 2 will be released next week (2/15-19)

### Roadmap



# **Aside: Units and Prefixes**

- Here focusing on large numbers (exponents > 0)
- Note that  $10^3 \approx 2^{10}$
- SI prefixes are *ambiguous* if base 10 or 2
- IEC prefixes are *unambiguously* base 2

| SI Size          | Prefix | Symbol | IEC Size        | Prefix | Symbol |
|------------------|--------|--------|-----------------|--------|--------|
| 10 <sup>3</sup>  | Kilo-  | K      | 2 <sup>10</sup> | Kibi-  | Ki     |
| 10 <sup>6</sup>  | Mega-  | М      | 2 <sup>20</sup> | Mebi-  | Mi     |
| 10 <sup>9</sup>  | Giga-  | G      | 2 <sup>30</sup> | Gibi-  | Gi     |
| 10 <sup>12</sup> | Tera-  | Т      | 2 <sup>40</sup> | Tebi-  | Ti     |
| 10 <sup>15</sup> | Peta-  | Р      | 2 <sup>50</sup> | Pebi-  | Pi     |
| 10 <sup>18</sup> | Exa-   | E      | 2 <sup>60</sup> | Exbi-  | Ei     |
| 10 <sup>21</sup> | Zetta- | Z      | 2 <sup>70</sup> | Zebi-  | Zi     |
| 10 <sup>24</sup> | Yotta- | Y      | 2 <sup>80</sup> | Yobi-  | Yi     |

SIZE PREFIXES (10<sup>x</sup> for Disk, Communication; 2<sup>x</sup> for Memory)

# How to Remember?

- Mnemonics
  - There unfortunately isn't one well-accepted mnemonic
    - But that shouldn't stop you from trying to come with one!
  - Killer Mechanical Giraffe Teaches Pet, Extinct Zebra to Yodel
  - Kirby Missed Ganondorf Terribly, Potentially Exterminating
     Zelda and Yoshi
  - xkcd: Karl Marx Gave The Proletariat Eleven Zeppelins, Yo
    - <u>https://xkcd.com/992/</u>
  - Post your best on Ed Discussion!

# **Reading Review**

- Terminology:
  - Caches: cache blocks, cache hit, cache miss
  - Principle of locality: temporal and spatial
  - Average memory access time (AMAT): hit time, miss penalty, hit rate, miss rate
- Questions from the Reading?

#### **Review Questions**

- Convert the following to or from IEC:
  - 512 Ki-books 512 = 2<sup>9</sup> Ki = 7 2<sup>10</sup> -> 2<sup>14</sup> books
  - $2^{27}$  caches  $2^{27} = 2^{2} + 2^{20} = 128$  Mi-caches
- Compute the average memory access time (AMAT) for the following system properties:
  - AMAT = HT + MR \* MP Hit time of 1 ns 103 + 0.01 \* 100ms
  - Miss rate of 1%

= |+| = 2ns

Miss penalty of 100 ns

### How does execution time grow with SIZE?



#### **Actual Data**



# Making memory accesses fast!

- \* Cache basics
- \* Principle of locality
- Memory hierarchies
- Cache organization
- Program optimizations that consider caches

#### **Processor-Memory Gap**



# **Problem: Processor-Memory Bottleneck**



#### **Problem: lots of waiting on memory**

cycle: single machine step (fixed-time)

### An Analogy – Cooking



# An Analogy – Cooking



#### For each ingredient, need to go to the store...ugh

- slow (it is far away)
- can only get a limited number of things at once

# An Analogy – Cooking



- If I have a fridge, can quickly get my ingredients as needed
- Re-stock the fridge with fewer trips to the grocery store to fetch a large number of ingredients



- Pronunciation: "cash"
  - We abbreviate this as "\$"
- I\$, D\$, L1\$, L2\$
- <u>English</u>: A hidden storage space for provisions, weapons, and/or treasures
- <u>Computer</u>: Memory with short access time used for the storage of frequently or recently used instructions (i-cache/I\$) or data (d-cache/D\$)
  - More generally: Used to optimize data transfers between any system elements with different characteristics (network interface cache, I/O cache, etc.)

# **General Cache Mechanics**



# **General Cache Concepts: Hit**



Data in block b is needed

Block b is in cache: Hit!

Data is returned to CPU

# **General Cache Concepts: Miss**



Data in block b is needed

Block b is not in cache: Miss!

Block b is fetched from memory

#### Block b is stored in cache

- Placement policy: determines where b goes
- Replacement policy: determines which block gets evicted (victim)

Data is returned to CPU

### Why Caches Work

 Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

block

# Why Caches Work

- Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
- *Temporal* locality:
  - Recently referenced items are *likely* to be referenced again in the near future

block

block

# Why Caches Work

- Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
- *Temporal* locality:
  - Recently referenced items are *likely* to be referenced again in the near future
- Spatial locality:
  - Items with nearby addresses *tend* to be referenced close together in time
- How do caches take advantage of this?

# **Example: Any Locality?**



Spatial: consecutive elements of array a [] accessed

#### Instructions:

- Temporal: cycle through loop repeatedly
- Spatial: reference instructions in sequence

```
int sum_array_rows(int a[M][N])
{
    int i, j, sum = 0;
    for (i = 0; i < M; i++) {
        for (j = 0; j < N; j++)col
            sum += a[i][j];
    return sum;
}</pre>
```



```
int sum_array_cols(int a[M][N])
{
    int i, j, sum = 0;
    for (j = 0; j < N; j++) cols
        for (i = 0; i < M; i++)vous
            sum += a[i][j];
    return sum;
}</pre>
```

```
int sum_array_cols(int a[M][N])
{
    int i, j, sum = 0;
    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}</pre>
```

#### Layout in Memory





Access Pattern: stride = ?





ł

# **Locality Example #3**

int sum array 3D(int a[M][N][L])

1cgnds

return sum;



How can it be fixed?

(1,1,1)



```
int sum_array_3D(int a[M][N][L])
{
    int i, j, k, sum = 0;
    for (i = 0; i < N; i++)
        for (j = 0; j < L; j++)
            for (k = 0; k < M; k++)
                sum += a[k][i][j];
    return sum;
}</pre>
```

- What is wrong with this code?
   Stude - N\*L
- How can it be fixed?
   inner : L
   mid : N
   outr : M

Layout in Memory (M = ?, N = 3, L = 4)



# **Cache Performance Metrics**

- Huge difference between a cache hit and a cache miss
  - Could be 100x speed difference between accessing cache and main memory (measured in *clock cycles*)
- Miss Rate (MR)
  - Fraction of memory references not found in cache (misses / accesses) = 1 Hit Rate
- Hit Time (HT)
  - Time to deliver a block in the cache to the processor
    - Includes time to determine whether the block is in the cache
- Miss Penalty (MP)
  - Additional time required because of a miss

# **Cache Performance**

- Two things hurt the performance of a cache:
  - Miss rate and miss penalty
- Average Memory Access Time (AMAT): average time to access memory considering both hits and misses AMAT = Hit time + Miss rate × Miss penalty (abbreviated AMAT = HT + MR × MP) = HT (I-MR) + (HT + MP)MR = HT - HT + MR + MP + HT + MR + MP
- 99% hit rate twice as good as 97% hit rate!
  - Assume HT of 1 clock cycle and MP of 100 clock cycles
  - 97%: AMAT = \ + .03 \* )00 = 4 AS
  - 99%: AMAT = 1 + .01 + 100 = 2.05

# **Practice Question**

 Processor specs: 200 ps clock, MP of 50 clock cycles, MR of 0.02, and HT of 1 clock cycle

AMAT = HT + MR \* MP = 1 + 0.02 \* 50 = 2 clubsAmaT = 2 \* 200 PS = 400 PS

Which improvement would be best?

A. 190 ps clock  $2^{4}190 = 380$  ps

B. Miss penalty of 40 clock cycles  $1 + 102^{\times 40} = 1.8$  clubs years = 360 ps **C.** MR of 0.015 misses/instruction  $1 + 1015^{\circ}50 = 1.75^{\circ}200 = 350$  ps

# Can we have more than one cache?

- Why would we want to do that?
  - Avoid going to memory!
- Typical performance numbers:
  - Miss Rate
    - L1 MR = 3-10%
    - L2 MR = Quite small (*e.g.*, < 1%), depending on parameters, etc.
  - Hit Time
    - L1 HT = 4 clock cycles
    - L2 HT = 10 clock cycles
  - Miss Penalty
    - P = 50-200 cycles for missing in L2 & going to main memory
    - Trend: increasing!

# Summary

- Memory Hierarchy
  - Successively higher levels contain "most used" data from lower levels
  - Exploits temporal and spatial locality
  - Caches are intermediate storage levels used to optimize data transfers between any system elements with different characteristics
- Cache Performance
  - Ideal case: found in cache (hit)
  - Bad case: not found in cache (miss), search in next level
  - Average Memory Access Time (AMAT) = HT + MR × MP
    - Hurt by Miss Rate and Miss Penalty