# Caches IV

CSE 351 Spring 2019

| Instructor:   | <b>Teaching Assistants:</b> |                |                |
|---------------|-----------------------------|----------------|----------------|
| Ruth Anderson | Gavin Cai                   | Jack Eggleston | John Feltrup   |
|               | Britt Henderson             | Richard Jiang  | Jack Skalitzky |
|               | Sophie Tian                 | Connie Wang    | Sam Wolfson    |
|               | Casey Xing                  | Chin Yeoh      |                |



http://xkcd.com/908/

## Administrivia

- Lab 3, due TONIGHT, Wednesday (5/15)
- Homework 4 , due Wed (5/22) (Structs, Caches)
- Lab 4, Coming soon!
  - Cache parameter puzzles and code optimizations



tag (there is only one set in this tiny cache, so the tag is the entire block address!)

Memory



In this example we are sort of ignoring block offsets. Here a block holds 2 bytes (16 bits, 4 hex digits).

Normally a block would be much bigger and thus there would be multiple items per block. While only one item in that block would be written at a time, the entire line would be brought into cache.







mov OxFACE, F







mov 0xFACE, F mov 0xFEED, F

mov G, %rax



## **Peer Instruction Question**

- Which of the following cache statements is FALSE?
  - Vote at <u>http://pollev.com/rea</u>
- Kolse A. We can reduce compulsory misses by decreasing our block size smaller block size pulls fever bytes into \$ On a miss
- Kin B. We can reduce conflict misses by increasing associativity more options to place blocks before evictions occur
- Time C. A write-back cache will save time for code with good temporal locality on writes get evicted, so fewer write-backs
- V D. A write-through cache will always match data with the memory hierarchy level below it yes, its main with the memory hierarchy level below it goal is data consistency
  - E. We're lost...

## **Optimizations for the Memory Hierarchy**

- Write code that has locality!
  - Spatial: access data contiguously
  - <u>Temporal</u>: make sure access to the same data is not too far apart in time
- How can you achieve locality?
  - Adjust memory accesses in *code* (software) to improve miss rate (MR)
    - Requires knowledge of *both* how caches work as well as your system's parameters
  - Proper choice of algorithm
  - Loop transformations

## **Example: Matrix Multiplication**



## **Matrices in Memory**

- How do cache blocks fit into this scheme?
  - Row major matrix in memory:



#### **Naïve Matrix Multiply**





14

## Cache Miss Analysis (Naïve)



B

- Scenario Parameters:
  - Square matrix (n × n), elements are doubles
  - Cache block size K = 64 B = 8 doubles 8 matrix elements per
  - Cache size  $C \ll n$  (much smaller than n) key assumption!
- \* Each iteration: •  $\frac{A}{a} + n = \frac{9n}{8}$  misses





A

1234

## Cache Miss Analysis (Naïve)



- Scenario Parameters:
  - Square matrix (n × n), elements are doubles
  - Cache block size K = 64 B = 8 doubles
  - Cache size C << n (much smaller than n)</p>



## Cache Miss Analysis (Naïve)



- Scenario Parameters:
  - Square matrix (n × n), elements are doubles
  - Cache block size K = 64 B = 8 doubles
  - Cache size C << n (much smaller than n)</p>



## Linear Algebra to the Rescue (1)



- Can get the same result of a matrix multiplication by splitting the matrices into smaller submatrices (matrix "blocks")
- For example, multiply two 4×4 matrices:

$$A = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{bmatrix} = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}, \text{ with } B \text{ defined similarly.}$$

$$AB = \begin{bmatrix} (A_{11}B_{11} + A_{12}B_{21}) & (A_{11}B_{12} + A_{12}B_{22}) \\ (A_{21}B_{11} + A_{22}B_{21}) & (A_{21}B_{12} + A_{22}B_{22}) \end{bmatrix}$$

## Linear Algebra to the Rescue (2)

This is extra (non-testable) material



Matrices of size  $n \times n$ , split into 4 blocks of size r (n=4r)

$$C_{22} = A_{21}B_{12} + A_{22}B_{22} + A_{23}B_{32} + A_{24}B_{42} = \sum_{k} A_{2k}^{*}B_{k2}$$

- Multiplication operates on small "block" matrices
  - Choose size so that they fit in the cache!
  - This technique called "cache blocking"





• r = block matrix size (assume r divides n evenly)

## **Cache Miss Analysis (Blocked)**



n/r blocks

X

- Scenario Parameters:
  - Cache block size K = 64 B = 8 doubles
  - Cache size C << n (much smaller than n)</li>

Three blocks  $\blacksquare$  ( $r \times r$ ) fit into cache:  $3r^2 < C$ 

 $r^2$  elements per block, 8 per cache block

- Sector Each block iteration:
  - $r^2/8$  misses per block
  - $2n/r \times r^2/8 = nr/4$

n/r blocks in row and column

## **Cache Miss Analysis (Blocked)**

- Scenario Parameters:
  - Cache block size K = 64 B = 8 doubles
  - Cache size C << n (much smaller than n)</li>
  - Three blocks  $\blacksquare$  ( $r \times r$ ) fit into cache:  $3r^2 < C$ 
    - $r^2$  elements per block, 8 per cache block
- Each block iteration:
  - $r^2/8$  misses per block
  - $2n/r \times r^2/8 = nr/4$

n/r blocks in row and column

 Afterwards in cache (schematic)







## **Cache Miss Analysis (Blocked)**



- Cache block size K = 64 B = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Three blocks  $\blacksquare$  ( $r \times r$ ) fit into cache:  $3r^2 < C$

 $r^2$  elements per block, 8 per cache block

- Each block iteration:
  - $r^2/8$  misses per block
  - $2n/r \times r^2/8 = nr/4$

n/r blocks in row and column

- Total misses:
  - $nr/4 \times (n/r)2 = \frac{n^3}{(4r)} \text{ vs. } 9n^3/8$





shaded areas show \$ blocks stored in the \$

## **Matrix Multiply Visualization**

Naïve:



≈ 90,000 cache misses

### **Cache-Friendly Code**

- Programmer can optimize for cache performance
  - How data structures are organized
  - How data are accessed
    - Nested loop structure
    - Blocking is a general technique
- All systems favor "cache-friendly code"
  - Getting absolute optimum performance is very platform specific
    - Cache size, cache block size, associativity, etc.
  - Can get most of the advantage with generic code
    - Keep working set reasonably small (temporal locality)
    - Use small strides (spatial locality)
    - Focus on inner loop code

great general mles of thumb!



### **Learning About Your Machine**

- Linux:
  - Iscpu
  - Is /sys/devices/system/cpu/cpu0/cache/index0/
    - <u>Ex</u>: cat /sys/devices/system/cpu/cpu0/cache/index\*/size

#### Windows:

- wmic memcache get <query> (all values in KB)
- Ex: wmic memcache get MaxCacheSize
- Modern processor specs: <u>http://www.7-cpu.com/</u>