Memory & Caches I
CSE 351 Autumn 2022

Instructor:
Justin Hsia

Teaching Assistants:
Angela Xu
Arjun Narendra
Armin Magness
Assaf Vayner
Carrie Hu
Clare Edmonds
David Dai
Dominick Ta
Effie Zheng
James Froelich
Jenny Peng
Kristina Lansang
Paul Stevans
Renee Ruan
Vincent Xiao
Relevant Course Information

- hw15 due Monday, hw16 due Wednesday
  - Veteran’s Day next Friday (11/11); no lecture

- Lab 3 due next Friday (11/11)
  - Make sure to look at HW15 before starting

- Midterm starts tomorrow (11/3-5)
  - Private posts on Ed Discussion, please!
The Hardware/Software Interface

- **Topic Group 1: Data**
  - Memory, Data, Integers, Floating Point, Arrays, Structs

- **Topic Group 2: Programs**
  - x86-64 Assembly, Procedures, Stacks, Executables

- **Topic Group 3: Scale & Coherence**
  - Caches, Processes, Virtual Memory, Memory Allocation
The Hardware/Software Interface

- Topic Group 3: **Scale & Coherence**
  - **Caches**, Processes, Virtual Memory, Memory Allocation

- How do we maintain logical consistency in the face of more data and more processes?
  - How do we support control flow both within many processes and things external to the computer?
  - How do we support data access, including dynamic requests, across multiple processes?
Aside: Units and Prefixes (Review)

- Here focusing on large numbers (exponents > 0)
- Note that $10^3 \approx 2^{10}$
- SI prefixes are *ambiguous* if base 10 or 2
- IEC prefixes are *unambiguously* base 2

<table>
<thead>
<tr>
<th>SI Size</th>
<th>Prefix</th>
<th>Symbol</th>
<th>IEC Size</th>
<th>Prefix</th>
<th>Symbol</th>
</tr>
</thead>
<tbody>
<tr>
<td>$10^3$</td>
<td>Kilo-</td>
<td>K</td>
<td>$2^{10}$</td>
<td>Kibi-</td>
<td>Ki</td>
</tr>
<tr>
<td>$10^6$</td>
<td>Mega-</td>
<td>M</td>
<td>$2^{20}$</td>
<td>Mebi-</td>
<td>Mi</td>
</tr>
<tr>
<td>$10^9$</td>
<td>Giga-</td>
<td>G</td>
<td>$2^{30}$</td>
<td>Gibi-</td>
<td>Gi</td>
</tr>
<tr>
<td>$10^{12}$</td>
<td>Tera-</td>
<td>T</td>
<td>$2^{40}$</td>
<td>Tebi-</td>
<td>Ti</td>
</tr>
<tr>
<td>$10^{15}$</td>
<td>Peta-</td>
<td>P</td>
<td>$2^{50}$</td>
<td>Pebi-</td>
<td>Pi</td>
</tr>
<tr>
<td>$10^{18}$</td>
<td>Exa-</td>
<td>E</td>
<td>$2^{60}$</td>
<td>Exbi-</td>
<td>Ei</td>
</tr>
<tr>
<td>$10^{21}$</td>
<td>Zetta-</td>
<td>Z</td>
<td>$2^{70}$</td>
<td>Zebi-</td>
<td>Zi</td>
</tr>
<tr>
<td>$10^{24}$</td>
<td>Yotta-</td>
<td>Y</td>
<td>$2^{80}$</td>
<td>Yobi-</td>
<td>Yi</td>
</tr>
</tbody>
</table>
How to Remember?

❖ Will be given to you on Final reference sheet

❖ Mnemonics
  ▪ There unfortunately isn’t one well-accepted mnemonic
    • But that shouldn’t stop you from trying to come with one!
    ▪ **Killer Mechanical Giraffe Teaches Pet, Extinct Zebra to Yodel**
    ▪ **Kirby Missed Ganondorf Terribly, Potentially Exterminating Zelda and Yoshi**
    ▪ xkcd: **Karl Marx Gave The Proletariat Eleven Zeppelins, Yo**
      • [https://xkcd.com/992/](https://xkcd.com/992/)
  ▪ Post your best on Ed Discussion!
Reading Review

❖ Terminology:
  ▪ Caches: cache blocks, cache hit, cache miss
  ▪ Principle of locality: temporal and spatial
  ▪ Average memory access time (AMAT): hit time, miss penalty, hit rate, miss rate

❖ Questions from the Reading?
Review Questions

- Convert the following to or from IEC:
  - $2^9 \times 2^{10}$ books = 512 Ki-books
  - $2^{27}$ Mi-caches = $2^9 \times 2^{20}$

- Compute the average memory access time (AMAT) for the following system properties:
  - Hit time of 1 ns
  - Miss rate of 1%
  - Miss penalty of 100 ns

\[
AMAT = HT + MR \times MP = 1\text{ ns} + 0.01(100\text{ ns}) = 1\text{ ns} + 1\text{ ns} = 2\text{ ns}
\]
How does execution time grow with SIZE?

```c
int array[SIZE];
int sum = 0;

for (int i = 0; i < 200000; i++) {
    for (int j = 0; j < SIZE; j++) {
        sum += array[j]; // execute SIZE x 200000 times
    }
}
```

Plot:

- Expect linear relationship with SIZE.
Actual Data

The graph shows the relationship between the size of a data set and the time it takes to process that data set. The graph indicates that as the size of the data set increases, the time it takes to process it also increases linearly. However, there is a kink in the graph, suggesting that there might be a threshold where the processing time starts to increase more rapidly. This could be due to the cache size not being sufficient to handle the increased data size efficiently.
Making memory accesses fast!

- Cache basics
- Principle of locality
- Memory hierarchies
- Cache organization
- Program optimizations that consider caches
Processor-Memory Gap

“Moore’s Law”
μProc
55%/year
(2X/1.5yr)

Processor-Memory Performance Gap
(grows 50%/year)

1989 first Intel CPU with cache on chip
1998 Pentium III has two cache levels on chip

DRAM
7%/year
(2X/10yrs)
Problem: Processor-Memory Bottleneck

Processor performance doubled about every 18 months

Bus latency / bandwidth evolved much slower

Main Memory

Core 2 Duo:
Can process at least 256 Bytes/cycle

Core 2 Duo:
Bandwidth 2 Bytes/cycle
Latency 100-200 cycles (30-60ns)

Problem: lots of waiting on memory

cycle: single machine step (fixed-time)
Problem: Processor-Memory Bottleneck

Processor performance doubled about every 18 months

Bus latency / bandwidth evolved much slower

Main Memory

Core 2 Duo:
Can process at least 256 Bytes/cycle

Core 2 Duo:
Bandwidth 2 Bytes/cycle
Latency 100-200 cycles (30-60ns)

Solution: caches

cycle: single machine step (fixed-time)
Cache 💰

- **Pronunciation**: “cash”
  - We abbreviate this as “$”

- **English**: A hidden storage space for provisions, weapons, and/or treasures

- **Computer**: Memory with short access time used for the storage of frequently or recently used instructions (i-cache/I$) or data (d-cache/D$)
  - *More generally*: Used to optimize data transfers between any system elements with different characteristics (network interface cache, I/O cache, etc.)
General Cache Mechanics (Review)

- Larger, slower, cheaper memory.
- Viewed as partitioned into “blocks”.
- Smaller, faster, more expensive memory.
- Caches a subset of the blocks.

**Cache**

Data is copied in block-sized transfer units.

**Memory**
General Cache Concepts: Hit (Review)

Data in block b is needed

Block b is in cache: Hit!

Data is returned to CPU
General Cache Concepts: **Miss** (Review)

1. **Data in block b is needed**

2. **Block b is fetched from memory**

3. **Block b is stored in cache**
   - Placement policy: determines where b goes
   - Replacement policy: determines which block gets evicted (victim)

4. **Data is returned to CPU**
Why Caches Work (Review)

- **Locality:** Programs tend to use data and instructions with addresses near or equal to those they have used recently
Why Caches Work (Review)

❖ **Locality**: Programs tend to use data and instructions with addresses near or equal to those they have used recently.

❖ **Temporal locality**:
  - Recently referenced items are *likely* to be referenced again in the near future.
Why Caches Work (Review)

❖ **Locality**: Programs tend to use data and instructions with addresses near or equal to those they have used recently

❖ **Temporal locality**:  
  - Recently referenced items are *likely* to be referenced again in the near future

❖ **Spatial locality**:  
  - Items with nearby addresses *tend* to be referenced close together in time

❖ How do caches take advantage of this?
Example: Any Locality?

```c
sum = 0;
for (i = 0; i < n; i++)
{
    sum += a[i];
}
return sum;
```

- **Data:**
  - **Temporal:** `sum` referenced in each iteration
  - **Spatial:** consecutive elements of array `a[]` accessed

- **Instructions:**
  - **Temporal:** cycle through loop repeatedly
  - **Spatial:** reference instructions in sequence
Locality Example #1

```c
int sum_array_rows(int a[M][N])
{
    int i, j, sum = 0;

    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];

    return sum;
}
```
Locality Example #1

```c
int sum_array_rows(int a[M][N])
{
    int i, j, sum = 0;

    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];

    return sum;
}
```

**Access Pattern:**

- stride = ?

**M = 3, N=4**

<p>| | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>a[0][0]</td>
<td>a[0][1]</td>
<td>a[0][2]</td>
<td>a[0][3]</td>
</tr>
<tr>
<td>a[1][0]</td>
<td>a[1][1]</td>
<td>a[1][2]</td>
<td>a[1][3]</td>
</tr>
</tbody>
</table>

**Layout in Memory**

```
76  92  108
```

**Note:** 76 is just one possible starting address of array `a`
Locality Example #2

```c
int sum_array_cols(int a[M][N])
{
    int i, j, sum = 0;

    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];

    return sum;
}
```
Locality Example #2

```c
int sum_array_cols(int a[M][N])
{
    int i, j, sum = 0;
    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}
```

**Layout in Memory**

**Access Pattern:**
- `stride = ?`
- `stride = N`
Locality Example #3

What is wrong with this code?

How can it be fixed?

```c
int sum_array_3D(int a[M][N][L])
{
    int i, j, k, sum = 0;
    for (i = 0; i < N; i++)
        for (j = 0; j < L; j++)
            for (k = 0; k < M; k++)
                sum += a[k][i][j];
    return sum;
}
```
Locality Example #3

```c
int sum_array_3D(int a[M][N][L])
{
    int i, j, k, sum = 0;

    for (i = 0; i < N; i++)
        for (j = 0; j < L; j++)
            for (k = 0; k < M; k++)
                sum += a[k][i][j];

    return sum;
}
```

❖ What is wrong with this code?
- **stride** - **N*L

❖ How can it be fixed?
- **inner loop**: i \(\rightarrow\) **stride** - **L**
- j \(\rightarrow\) **stride** - **1**
- k \(\rightarrow\) **stride** - **N*L

Layout in Memory (M = ?, N = 3, L = 4)
Cache Performance Metrics (Review)

- Huge difference between a cache hit and a cache miss
  - Could be 100x speed difference between accessing cache and main memory (measured in clock cycles)

- Miss Rate (MR)
  - Fraction of memory references not found in cache (misses / accesses) = 1 - Hit Rate

- Hit Time (HT)
  - Time to deliver a block in the cache to the processor
    - Includes time to determine whether the block is in the cache

- Miss Penalty (MP)
  - Additional time required because of a miss
Cache Performance (Review)

- Two things hurt the performance of a cache:
  - Miss rate and miss penalty

- Average Memory Access Time (AMAT): average time to access memory considering both hits and misses

\[
\text{AMAT} = \text{Hit time} + \text{Miss rate} \times \text{Miss penalty}
\]

(abbreviated \(\text{AMAT} = \text{HT} + \text{MR} \times \text{MP}\))

- 99% hit rate twice as good as 97% hit rate!
  - Assume HT of 1 clock cycle and MP of 100 clock cycles
  - 97%: \(\text{AMAT} = 1 + 0.03 \times 100 = 4\) clock cycles
  - 99%: \(\text{AMAT} = 1 + 0.01 \times 100 = 2\) clock cycles
Practice Question

- **Processor specs:** 200 ps clock, MP of 50 clock cycles, MR of 0.02 misses/instruction, and HT of 1 clock cycle

  \[
  \text{AMAT} = \text{HT} + \text{MR} \times \text{MP} = 1 + 0.02 \times 50 = 2 \text{ clock cycles} = 400 \text{ ps}
  \]

- Which improvement would be best?
  
  A. **190 ps clock**  
     (overclocking, faster CPU)  
     
     \[2 \text{ clock cycles} \approx 380 \text{ ps}\]
  
  B. **Miss penalty of 40 clock cycles**  
     (reduced Mem size)  
     
     \[1 + 0.02 \times 40 = 1.8 \text{ clock cycles} \approx 360 \text{ ps}\]
  
  C. **MR of 0.015 misses/instruction**  
     (write better code)  
     
     \[1 + 0.015 \times 50 = 1.75 \text{ clock cycles} \approx 350 \text{ ps}\]
Can we have more than one cache?

❖ Why would we want to do that?
  ▪ Avoid going to memory!

❖ Typical performance numbers:
  ▪ Miss Rate
    • L1 MR = 3-10%
    • L2 MR = Quite small (e.g., < 1%), depending on parameters, etc.
  ▪ Hit Time
    • L1 HT = 4 clock cycles
    • L2 HT = 10 clock cycles
  ▪ Miss Penalty
    • P = 50-200 cycles for missing in L2 & going to main memory
    • Trend: increasing!
An Example Memory Hierarchy

- **Registers**: <1 ns
- **On-chip L1 cache (SRAM)**: 1 ns
- **Off-chip L2 cache (SRAM)**: 5-10 ns
- **Main memory (DRAM)**: 100 ns
- **SSD**: 150,000 ns
- **Disk**: 1,000,000 ns (10 ms)
- **Remote secondary storage (distributed file systems, web servers)**: 1-150 ms

**Storage Characteristics**:
- **Smaller, faster, costlier per byte**
- **Larger, slower, cheaper per byte**

**Access Times**:
- Register: <1 ns
- L1 cache: 1 ns
- L2 cache: 5-10 ns
- Main memory: 100 ns
- SSD: 150,000 ns
- Disk: 1,000,000 ns (10 ms)
- Remote storage: 1-150 ms

**Latency Times**:
- Register: 1-15 years
- L1 cache: 1-2 min
- L2 cache: 15-30 min
- Main memory: 15-30 min
- SSD: 31 days
- Disk: 66 months = 5.5 years
- Remote storage: 1-15 years
Summary

❖ Memory Hierarchy
  ▪ Successively higher levels contain “most used” data from lower levels
  ▪ Exploits *temporal and spatial locality*
  ▪ Caches are intermediate storage levels used to optimize data transfers between any system elements with different characteristics

❖ Cache Performance
  ▪ Ideal case: found in cache (hit)
  ▪ Bad case: not found in cache (miss), search in next level
  ▪ Average Memory Access Time (AMAT) = HT + MR × MP
    • Hurt by Miss Rate and Miss Penalty