### Lecture 19 (Mon 11/10/2008)

- Lab #3 Software Only Due Fri Nov 14 at 5pm
- HW #3 Cache Simulator & code optimization Due Mon Nov 24 at 5pm

#### Caches: Writing & Performance



















- The advantage of write-back caches is that not all write operations need to access main memory, as with write-through caches.
  - If a single address is frequently written to, then it doesn't pay to keep writing that data through to main memory.
  - If several bytes within the same cache block are modified, they will only force one memory write operation at write-back time.





























0.2 x I.



# Performance example • Assume that 33% of the instructions in a program are data accesses. The cache hit ratio is 97% and the hit time is one cycle, but the miss penalty is 20 cycles. Memory stall cycles = Memory accesses x Miss rate x Miss penalty = 0.33 I x 0.03 x 20 cycles = 0.2 I cycles but memory performance remained the same? If I instructions are executed, then the number of wasted cycles will be This code is 1.2 times slower than a program with a "perfect" CPI of 1! Refer to Amdahl's Law from textbook page 267. 27

Memory systems are a bottleneck CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time Processor performance traditionally outpaces memory performance, so the memory system is often the system bottleneck. • For example, with a base CPI of 1, the CPU time from the last page is: CPU time =  $(I + 0.2 I) \times Cycle$  time • What if we could *double* the CPU performance so the CPI becomes 0.5, CPU time =  $(0.5 I + 0.2 I) \times Cycle$  time The overall CPU time improves by just 1.2/0.7 = 1.7 times! - Speeding up only part of a system has diminishing returns. 28

## Basic main memory design

- There are some ways the main memory can be organized to reduce miss penalties and help with caching.
- For some concrete examples, let's assume the following three steps are taken when a cache needs to load data from the main memory.
  - 1. It takes 1 cycle to send an address to the RAM.
  - 2. There is a 15-cycle latency for each RAM access.
  - 3. It takes 1 cycle to return data from the RAM.
- In the setup shown here, the buses from the CPU to the cache and from the cache to RAM are all one word wide.
- If the cache has one-word blocks, then filling a block from RAM (*i.e.*, the miss penalty) would take 17 cycles.

### 1 + 15 + 1 = 17 clock cycles

• The cache controller has to send the desired address to the RAM, wait and receive the data.

29

CPU

Cache

Main

Memorv

















| Writing Cache Friendly Code                                                                                                                                                                                                             |                                                                                                                                                                                     |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Two major rules:</li> <li>Repeated references to data are good (temporal locality)</li> <li>Stride-1 reference patterns are good (spatial locality)</li> <li>Example: cold cache, 4-byte words, 4-word cache blocks</li> </ul> |                                                                                                                                                                                     |
| <pre>int sum_array_rows(int a[M][N]) {     int i, j, sum = 0;     for (i = 0; i &lt; M; i++)         for (j = 0; j &lt; N; j++)             sum += a[i][j];     return sum; }</pre>                                                     | <pre>int sum_array_cols(int a[M][N]) {     int i, j, sum = 0;     for (j = 0; j &lt; N; j++)         for (i = 0; i &lt; M; i++)             sum += a[i][j];     return sum; }</pre> |
| Miss rate = 1/4 = 25%                                                                                                                                                                                                                   | Miss rate = 100%<br>37<br>Adapted from Randy Bryant                                                                                                                                 |