The Hardware/Software Interface

Memory and Caches I

Themes of CSE 351

- Interfaces and abstractions
  - So far: data type abstractions in C; x86 instruction set architecture (interface to hardware)
  - Today: abstractions of memory
    - Soon: process and virtual memory abstractions

- Representation
  - Integers, floats, addresses, arrays, structs

- Translation
  - Understand the assembly code that will be generated from C code

- Control flow
  - Procedures and stacks; buffer overflows

Making memory accesses fast!

- Cache basics
- Principle of locality
- Memory hierarchies
- Cache organization
- Program optimizations that consider caches
How does execution time grow with SIZE?

```c
int array[SIZE];
int A = 0;

for (int i = 0; i < 200000; ++i) {
    for (int j = 0; j < SIZE; ++j) {
        A += array[j];
    }
}
```

Problem: Processor-Memory Bottleneck

Processor performance doubled about every 18 months

Bus bandwidth evolved much slower

Core 2 Duo:
Can process at least 256 Bytes/cycle

Core 2 Duo:
Bandwidth
2 Bytes/cycle
Latency
100 cycles

Problem: lots of waiting on memory

Solution: caches
Cache

- **English definition:** a hidden storage space for provisions, weapons, and/or treasures

- **CSE definition:** computer memory with short access time used for the storage of frequently or recently used instructions or data (i-cache and d-cache)

  more generally,

  used to optimize data transfers between system elements with different characteristics (network interface cache, I/O cache, etc.)

---

General Cache Mechanics

**Cache**

- Smaller, faster, more expensive memory caches a subset of the blocks
- Data is copied in block-sized transfer units

**Memory**

- Larger, slower, cheaper memory viewed as partitioned into "blocks"

---

General Cache Concepts: **Hit**

**Request:** 14

**Data in block b is needed**

**Block b is in cache:** *Hit!

- **Cache**
  
  8  9  14  3

- **Memory**
  
  0  1  2  3
  4  5  6  7
  8  9  10  11
  12  13  14  15

---

General Cache Concepts: **Miss**

**Request:** 12

**Data in block b is needed**

**Block b is not in cache:** *Miss!

**Block b is fetched from memory**

- **Cache**
  
  8  12  14  3

- **Memory**
  
  0  1  2  3
  4  5  6  7
  8  9  10  11
  12  13  14  15
Cost of Cache Misses

- Huge difference between a hit and a miss
  - Could be 100x, if just L1 and main memory

- Would you believe 99% hits is twice as good as 97%?
  - Consider:
    - Cache hit time of 1 cycle
    - Miss penalty of 100 cycles
  - Average access time:
    - 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
    - 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

- This is why “miss rate” is used instead of “hit rate”

Why Caches Work

- Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
  - Temporal locality:
    - Recently referenced items are likely to be referenced again in the near future
    - Why is this important?
  - Spatial locality:
    - Items with nearby addresses tend to be referenced close together in time
    - How do caches take advantage of this?

Example: Locality?

```
sum = 0;
for (i = 0; i < n; i++)
  sum += a[i];
return sum;
```

- Data:
  - Temporal: sum referenced in each iteration
  - Spatial: array a[] accessed in stride-1 pattern

- Instructions:
  - Temporal: cycle through loop repeatedly
  - Spatial: reference instructions in sequence

- Being able to assess the locality of code is a crucial skill for a programmer

Locality Example #1

```c
int sum_array_rows(int a[M][N])
{
    int i, j, sum = 0;
    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];
    return sum;
}
```
Locality Example #2

```c
int sum_array_cols(int a[M][N])
{
    int i, j, sum = 0;
    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
}
```

Memory Hierarchies

- Some fundamental and enduring properties of hardware and software systems:
  - Faster storage technologies almost always cost more per byte and have lower capacity
  - The gaps between memory technology speeds are widening
    - True for: registers ↔ cache, cache ↔ DRAM, DRAM ↔ disk, etc.
    - Well-written programs tend to exhibit good locality

- These properties complement each other beautifully

- They suggest an approach for organizing memory and storage systems known as a memory hierarchy

**An Example Memory Hierarchy**

- Fundamental idea of a memory hierarchy:
  - For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.

- Why do memory hierarchies work?
  - Because of locality, programs tend to access the data at level k more often than they access the data at level k+1.
  - Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.

- Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
Cache Performance Metrics

- **Miss Rate**
  - Fraction of memory references not found in cache (misses / accesses)
    - $= 1 - \text{hit rate}$
  - Typical numbers (in percentages):
    - 3% - 10% for L1
    - Can be quite small (e.g., < 1%) for L2, depending on size, etc.

- **Hit Time**
  - Time to deliver a line in the cache to the processor
    - Includes time to determine whether the line is in the cache
  - Typical hit times: 1 - 2 clock cycles for L1; 5 - 20 clock cycles for L2

- **Miss Penalty**
  - Additional time required because of a miss
  - Typically 50 - 200 cycles for L2 (trend: increasing!)

Examples of Caching in the Hierarchy

<table>
<thead>
<tr>
<th>Cache Type</th>
<th>What is Cached?</th>
<th>Where is it Cached?</th>
<th>Latency (cycles)</th>
<th>Managed By</th>
</tr>
</thead>
<tbody>
<tr>
<td>Registers</td>
<td>4/8-byte words</td>
<td>CPU core</td>
<td>0</td>
<td>Compiler</td>
</tr>
<tr>
<td>TLB</td>
<td>Address translations</td>
<td>On-Chip TLB</td>
<td>0</td>
<td>Hardware</td>
</tr>
<tr>
<td>L1 cache</td>
<td>64-bytes block</td>
<td>On-Chip L1</td>
<td>1</td>
<td>Hardware</td>
</tr>
<tr>
<td>L2 cache</td>
<td>64-bytes block</td>
<td>Off-Chip L2</td>
<td>10</td>
<td>Hardware</td>
</tr>
<tr>
<td>Virtual Memory</td>
<td>4-KB page</td>
<td>Main memory</td>
<td>100</td>
<td>Hardware+OS</td>
</tr>
<tr>
<td>Buffer cache</td>
<td>Parts of files</td>
<td>Main memory</td>
<td>100</td>
<td>OS</td>
</tr>
<tr>
<td>Network cache</td>
<td>Parts of files</td>
<td>Local disk</td>
<td>10,000,000</td>
<td>File system client</td>
</tr>
<tr>
<td>Browser cache</td>
<td>Web pages</td>
<td>Local disk</td>
<td>10,000,000</td>
<td>Web browser</td>
</tr>
<tr>
<td>Web cache</td>
<td>Web pages</td>
<td>Remote server disks</td>
<td>1,000,000,000</td>
<td>Web server</td>
</tr>
</tbody>
</table>

Memory Hierarchy: Core 2 Duo

L1/L2 cache: 64 B blocks

Throughput:
- 16 B/cycle (3 cycles)
- 8 B/cycle (14 cycles)
- 2 B/cycle (100 cycles)
- 1 B/30 cycles (millions)

Latency:
- Disk
  - 500 GB
  - Not drawn to scale

- Main Memory
  - ~4 GB
  - ~4 MB
  - 32 KB
  - 16 B-cache
  - 8 B-cache
  - 2 B-cache

- CPU Reg
  - 1 B/8 cycles

- Throughput
  - 16 B/cycle
  - 8 B/cycle
  - 2 B/cycle
  - 1 B/30 cycles

- Latency
  - 3 cycles
  - 14 cycles
  - 100 cycles
  - millions

Not drawn to scale