## More \$ (caches, yes)

- Trick or treat!
- Midterm questions?
  - Note: practice midterms posted
- HW 2 due today
- Lab 3 will be released soon
  - You will implement a buffer overflow attack! Huahuahua! ©

# Deja-vu

```
int array[SIZE];
int A = 0;
for (int i = 0; i < 200000; ++ i) {
  for (int j = 0; j < SIZE; ++ j) {
       A += array[j];
   }
}
                             Runtime
                      Plot
                                                    SIZE
```

# Not to forget...



## **General Cache Mechanics**



## **General Cache Concepts: Hit**



## **General Cache Concepts: Miss**



## **General Cache Concepts: Miss**



## **General Cache Concepts: Miss**



## **Cache Performance Metrics**

#### Miss Rate

- Fraction of memory references not found in cache (misses / accesses)
   = 1 hit rate
- Typical numbers (in percentages):
  - 3-10% for L1
  - can be quite small (e.g., < 1%) for L2, depending on size, etc.

#### Hit Time

- Time to deliver a line in the cache to the processor
  - includes time to determine whether the line is in the cache
- Typical numbers:
  - 1-2 clock cycle for L1
  - 5-20 clock cycles for L2

#### Miss Penalty

- Additional time required because of a miss
  - typically 50-200 cycles for main memory (<u>trend: increasing!</u>)



## Lets think about those numbers

- Huge difference between a hit and a miss
  - Could be 100x, if just L1 and main memory
- Would you believe 99% hits is twice as good as 97%?
  - Consider: cache hit time of 1 cycle miss penalty of 100 cycles

## Lets think about those numbers

- Huge difference between a hit and a miss
  - Could be 100x, if just L1 and main memory
- Would you believe 99% hits is twice as good as 97%?
  - Consider:
     cache hit time of 1 cycle
     miss penalty of 100 cycles
  - Average access time:

```
97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
```

■ This is why "miss rate" is used instead of "hit rate"

# Why do caches work?

■ Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

## Temporal locality:

 Recently referenced items are likely to be referenced again in the near future



Why is this important?

Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

## **■** Temporal locality:

Recently referenced items are *likely* to be referenced again in the near future



Spatial locality?

 Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

## Temporal locality:

 Recently referenced items are likely to be referenced again in the near future



### Spatial locality:

 Items with nearby addresses tend to be referenced close together in time



How do caches take advantage of this?

#### Data:

- Temporal: **sum** referenced in each iteration
- Spatial: array a [] accessed in stride-1 pattern

#### Data:

- Temporal: sum referenced in each iteration
- Spatial: array a [] accessed in stride-1 pattern

#### Instructions:

- Temporal: cycle through loop repeatedly
- Spatial: reference instructions in sequence

#### Data:

- Temporal: sum referenced in each iteration
- Spatial: array a [] accessed in stride-1 pattern

#### Instructions:

- Temporal: cycle through loop repeatedly
- Spatial: reference instructions in sequence
- Being able to assess the locality of code is a crucial skill for a programmer

```
int sum_array_rows(int a[M][N])
{
   int i, j, sum = 0;

   for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
        sum += a[i][j];
   return sum;
}</pre>
```

```
a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
a[2][0] a[2][1] a[2][2] a[2][3]
```

```
int sum_array_rows(int a[M][N])
{
   int i, j, sum = 0;

   for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];
   return sum;
}</pre>
```

```
a[0][0]
          a[0][1]
                     a[0][2]
                                a[0][3]
a[1][0]
          a[1][1]
                     a[1][2]
                                a[1][3]
a[2][0]
          a[2][1]
                     a[2][2]
                                a[2][3]
                 1: a[0][0]
                 2: a[0][1]
                 3: a[0][2]
                 4: a[0][3]
                 5: a[1][0]
                 6: a[1][1]
                 7: a[1][2]
                8: a[1][3]
                 9: a[2][0]
                10: a[2][1]
                11: a[2][2]
                12: a[2][3]
```

#### stride-1

```
int sum_array_cols(int a[M][N])
{
   int i, j, sum = 0;

   for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
   return sum;
}</pre>
```

```
      a[0][0]
      a[0][1]
      a[0][2]
      a[0][3]

      a[1][0]
      a[1][1]
      a[1][2]
      a[1][3]

      a[2][0]
      a[2][1]
      a[2][2]
      a[2][3]
```

```
int sum_array_cols(int a[M][N])
{
   int i, j, sum = 0;

   for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
   return sum;
}</pre>
```

```
a[0][0]
          a[0][1]
                     a[0][2]
                                a[0][3]
a[1][0]
          a[1][1]
                     a[1][2]
                                a[1][3]
a[2][0]
          a[2][1]
                     a[2][2]
                                a[2][3]
                 1: a[0][0]
                 2: a[1][0]
                 3: a[2][0]
                 4: a[0][1]
                 5: a[1][1]
                 6: a[2][1]
                 7: a[0][2]
                 8: a[1][2]
                 9: a[2][2]
                10: a[0][3]
                11: a[1][3]
                12: a[2][3]
```

#### stride-N

```
int sum_array_3d(int a[M][N][N])
{
   int i, j, k, sum = 0;

   for (i = 0; i < N; i++)
        for (j = 0; j < N; j++)
            for (k = 0; k < M; k++)
            sum += a[k][i][j];
   return sum;
}</pre>
```

- What is wrong with this code?
- How can it be fixed?

## Important questions about \$

- When we copy a block of data from main memory to the cache, where exactly should we put it?
- How can we tell if a word is already in the cache, or if it has to be fetched from main memory first?
- Eventually, the small cache memory might fill up. To load a new block from main RAM, we'd have to replace one of the existing blocks in the cache... which one?
- How can write operations be handled by the memory system?

# Where should we put data in the cache?

- A direct-mapped cache is the simplest approach: each main memory address maps to exactly one cache block.
- For example, on the right is a 16-byte main memory and a 4-byte cache (four 1-byte blocks).
- Memory locations 0, 4, 8 and 12 all map to cache block 0.
- Addresses 1, 5, 9 and 13 map to cache block 1, etc.
- How can we compute this mapping?



# Ok, we know where to look for data...

But how do we know if the data is what we want??

## Adding tags

We need to add tags to the cache, which supply the rest of the address bits to let us distinguish between different memory locations that map to the same cache block.



## What's a cache block? (or cache line)



# **Direct Mapped Caches**

Any problems with them?

## Disadvantage of direct mapping

- The direct-mapped cache is easy: indices and offsets can be computed with bit operators or simple arithmetic, because each memory address belongs in exactly one block.
- But, what happens if a program uses addresses 2, 6, 2, 6, 2, ...?



## **Associativity**

■ What if we could store data in *any* place in the cache?

## **Associativity**

- What if we could store data in any place in the cache?
- But that might slow down caches... so we do something in between.



# But now how do I know where data goes?



# But now how do I know where data goes?



Our example used a 2<sup>2</sup>-block cache with 2<sup>1</sup> bytes per block. Where would 13 (1101) be stored?

# A puzzle.

- What can you infer from this:
- Cache starts empty
- Access (addr, hit/miss) stream
- (10, miss), (11, hit), (12, miss)

# General Cache Organization (S, E, B)





# **Example: Direct-Mapped Cache (E = 1)**

Direct-mapped: One line per set Assume: cache block size 8 bytes



# **Example: Direct-Mapped Cache (E = 1)**

Direct-mapped: One line per set Assume: cache block size 8 bytes



# **Example: Direct-Mapped Cache (E = 1)**

Direct-mapped: One line per set Assume: cache block size 8 bytes



No match: old line is evicted and replaced

E = 2: Two lines per set

Assume: cache block size 8 bytes

#### Address of short int:

| t bits 0 | .01 100 |
|----------|---------|
|----------|---------|



E = 2: Two lines per set



E = 2: Two lines per set



E = 2: Two lines per set



### No match:

- One line in set is selected for eviction and replacement
- Replacement policies: random, least recently used (LRU), ...

# Example placement in set-associative caches

- Where would data from memory byte address 6195 be placed, assuming the eight-block cache designs below, with 16 bytes per block?
- 6195 in binary is 00...0110000 011 0011.

1-way associativity 8 sets, 1 block each

2-way associativity 4 sets, 2 blocks each



4-way associativity 2 sets, 4 blocks each



# Example placement in set-associative caches

- Where would data from memory byte address 6195 be placed, assuming the eight-block cache designs below, with 16 bytes per block?
- 6195 in binary is 00...0110000 011 0011.
- Each block has 16 bytes, so the lowest 4 bits are the block offset.
- For the 1-way cache, the next three bits (011) are the set index. For the 2-way cache, the next two bits (11) are the set index. For the 4-way cache, the next one bit (1) is the set index.
- The data may go in any block, shown in green, within the correct set.



# Block replacement

- Any empty block in the correct set may be used for storing data.
- If there are no empty blocks, which one should we replace?



# Block replacement

- Any empty block in the correct set may be used for storing data.
- If there are no empty blocks, the cache controller will attempt to replace the least recently used block, just like before.
- For highly associative caches, it's expensive to keep track of what's really the least recently used block, so some approximations are used. We won't get into the details.

1-way associativity 2-way associativity 4-way associativity 8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each Set Set Set 0 0 0 2 3 4 2 6 3 7

# Another puzzle.

- What can you infer from this:
- Cache starts empty
- Access (addr, hit/miss) stream
- (10, miss); (12, miss); (10, miss)

# **Types of Cache Misses**

- Cold (compulsory) miss
  - Occurs on first access to a block

### **Types of Cache Misses**

### Cold (compulsory) miss

Occurs on first access to a block

#### Conflict miss

- Most hardware caches limit blocks to a small subset (sometimes just one)
   of the available cache slots
  - if one (e.g., block i must be placed in slot (i mod size)), <a href="mailto:direct-mapped">direct-mapped</a>
  - if more than one, n-way <u>set-associative</u> (where n is a power of 2)
- Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot
  - e.g., referencing blocks 0, 8, 0, 8, ... would miss every time=

### **Types of Cache Misses**

### Cold (compulsory) miss

Occurs on first access to a block

### Conflict miss

- Most hardware caches limit blocks to a small subset (sometimes just one)
  of the available cache slots
  - if one (e.g., block i must be placed in slot (i mod size)), <a href="mailto:direct-mapped">direct-mapped</a>
  - if more than one, n-way <u>set-associative</u> (where n is a power of 2)
- Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot
  - e.g., referencing blocks 0, 8, 0, 8, ... would miss every time

### Capacity miss

Occurs when the set of active cache blocks (the working set)
is larger than the cache (just won't fit)

### What about writes?

### Multiple copies of data exist:

L1, L2, Main Memory, Disk

### What to do on a write-hit?

- Write-through (write immediately to memory)
- Write-back (defer write to memory until replacement of line)
  - How do we know when to write?

### What to do on a write-miss?

- Write-allocate (load into cache, then write)
  - When is this useful?
- No-write-allocate (writes immediately to memory)

### Typical

- Write-through + No-write-allocate
- Write-back + Write-allocate

# **Memory Hierarchies**

- Some fundamental and enduring properties of hardware and software systems:
  - Faster storage technologies almost always cost more per byte and have lower capacity
  - The gaps between memory technology speeds are widening
    - True for: registers  $\leftrightarrow$  cache, cache  $\leftrightarrow$  DRAM, DRAM  $\leftrightarrow$  disk, etc.
  - Well-written programs tend to exhibit good locality
- These properties complement each other beautifully
- They suggest an approach for organizing memory and storage systems known as a memory hierarchy

### **An Example Memory Hierarchy**



# **Typical Memory Hierarchy (Intel Core i7)**



# **Examples of Caching in the Hierarchy**

| Cache Type     | What is Cached?      | Where is it Cached? | Latency<br>(cycles) | Managed By         |
|----------------|----------------------|---------------------|---------------------|--------------------|
| Registers      | 4-byte words         | CPU core            | 0                   | Compiler           |
| TLB            | Address translations | On-Chip TLB         | 0                   | Hardware           |
| L1 cache       | 64-bytes block       | On-Chip L1          | 1                   | Hardware           |
| L2 cache       | 64-bytes block       | Off-Chip L2         | 10                  | Hardware           |
| Virtual Memory | 4-KB page            | Main memory         | 100                 | Hardware+OS        |
| Buffer cache   | Parts of files       | Main memory         | 100                 | os                 |
| Network cache  | Parts of files       | Local disk          | 10,000,000          | File system client |
| Browser cache  | Web pages            | Local disk          | 10,000,000          | Web browser        |
| Web cache      | Web pages            | Remote server disks | 1,000,000,000       | Web server         |

# **Memory Hierarchy: Core 2 Duo**

Not drawn to scale

L1/L2 cache: 64 B blocks



# Where else is caching used?

### **Software Caches are More Flexible**

### Examples

File system buffer caches, web browser caches, etc.

### Some design differences

- Almost always fully-associative
  - so, no placement restrictions
  - index structures like hash tables are common (for placement)
- Often use complex replacement policies
  - misses are very expensive when disk or network involved
  - worth thousands of cycles to avoid them
- Not necessarily constrained to single "block" transfers
  - may fetch or write-back in larger units, opportunistically

# **Optimizations for the Memory Hierarchy**

### Write code that has locality

- Spatial: access data contiguously
- Temporal: make sure access to the same data is not too far apart in time

### How to achieve?

- Proper choice of algorithm
- Loop transformations

### Cache versus register-level optimization:

- In both cases locality desirable
- Register space much smaller
   + requires scalar replacement to exploit temporal locality
- Register level optimizations include exhibiting instruction level parallelism (conflicts with locality)

# **Example: Matrix Multiplication**



# **Cache Miss Analysis**

#### Assume:

- Matrix elements are doubles
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>

### First iteration:

n/8 + n = 9n/8 misses (omitting matrix c)



Afterwards in cache: (schematic)



# **Cache Miss Analysis**

### Assume:

- Matrix elements are doubles
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>

### Other iterations:

Again:n/8 + n = 9n/8 misses(omitting matrix c)



### Total misses:

# **Blocked Matrix Multiplication**



n/B blocks

# **Cache Miss Analysis**

#### Assume:

- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Four blocks fit into cache: 4B<sup>2</sup> < C

### **■** First (block) iteration:

- B<sup>2</sup>/8 misses for each block
- $2n/B * B^2/8 = nB/4$  (omitting matrix c)





# **Cache Miss Analysis**

#### Assume:

- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Three blocks fit into cache: 3B<sup>2</sup> < C

### Other (block) iterations:

- Same as first iteration
- 2n/B \* B<sup>2</sup>/8 = nB/4



### Total misses:

•  $nB/4 * (n/B)^2 = n^3/(4B)$ 

### Summary

- No blocking: (9/8) \* n³
- Blocking: 1/(4B) \* n<sup>3</sup>
- If B = 8 difference is 4 \* 8 \* 9 / 8 = 36x
- If B = 16 difference is 4 \* 16 \* 9 / 8 = 72x
- Suggests largest possible block size B, but limit 4B² < C! (can possibly be relaxed a bit, but there is a limit for B)
- Reason for dramatic difference:
  - Matrix multiplication has inherent temporal locality:
    - Input data: 3n<sup>2</sup>, computation 2n<sup>3</sup>
    - Every array elements used O(n) times!
  - But program has to be written properly