#### **Caches IV**

CSE 351 Spring 2020

**Instructor:** Teaching Assistants:

Ruth Anderson Alex Olshanskyy Callum Walker Chin Yeoh

Connie Wang Diya Joy Edan Sneh

L19: Caches IV

Eddy (Tianyi) Zhou Eric Fan Jeffery Tian

Jonathan Chen Joseph Schafer Melissa Birchfield

Millicent Li Porter Jones Rehaan Bhimani









http://xkcd.com/908/

#### **Administrivia**

- Lab 3 due this Wednesday (5/13)
- Lab 4 coming soon!
  - Cache parameter puzzles and code optimizations

- You must log on with your @uw google account to access!!
  - Google doc for 11:30 Lecture: <a href="https://tinyurl.com/351-05-11A">https://tinyurl.com/351-05-11A</a>
  - Google doc for 2:30 Lecture: <a href="https://tinyurl.com/351-05-11B">https://tinyurl.com/351-05-11B</a>

#### What about writes?

- Multiple copies of data may exist:
  - multiple levels of cache and main memory
- \* What to do on a write-hit? (block/data already in \$)
  - Write-through: write immediately to next level
  - Write-back. defer write to next level until line is evicted (replaced)
    - Must track which cache lines have been modified ("dirty bit") 

      extra management
- \* What to do on a write-miss? (block /data not currently in \$)
  - Write allocate: ("fetch on write") load into cache, then execute the write-hit policy
    - Good if more writes or reads to the location follow
  - No-write allocate: ("write around") just write immediately to next level
- Typical caches:
  - Write-back + Write allocate, usually
  - Write-through + No-write allocate, occasionally

#### Miss

#### Write-back, Write Allocate Example

<u>Note</u>: While unrealistic, this example assumes that all requests have offset 0 and are for a block's worth of data.





There is only one set in this tiny cache, so the tag is the entire block number!

Memory:





Write-back, Write Allocate Example

Not valid x86, just using block num instead of full byte address to keep the example simple 1) mov \$0xFACE, (F)

Write Miss!



# hit miss

## Write-back, Write Allocate Example





#### Miss

# Write-back, Write Allocate Example

1) mov \$0xFACE, (F)
Write Miss

Cache:



Memory:



Step 1: Bring **F** into cache

Step 2: Write 0xFACE to cache only and set the dirty bit



#### Write-back, Write Allocate Example



Memory:



Step: Write 0xFEED to cache only (and set the dirty bit)



#### Miss

# Write-back, Write Allocate Example

1) mov \$0xFACE, (F) 2) mov \$0xFEED, (F) Write Miss Write Hit

Cache:



Memory:



#### Write-back, Write Allocate Example

- 1) mov \$0xFACE, (F) 2) mov \$0xFEED, (F) 3) mov (G), %ax Write Miss
  - Write Hit
- Read Miss!



Step 1: Write F back to memory since it is dirty



# Write-back, Write Allocate Example

new bluck is consistent with



Memory:



2) load new

block

Step 1: Write **F** back to memory since it is dirty

Step 2: Bring **G** into the cache so that we can copy it into %ax

#### **Cache Simulator**

- Want to play around with cache parameters and policies? Check out our cache simulator!
  - https://courses.cs.washington.edu/courses/cse351/cachesim/

#### Way to use:

- Take advantage of "explain mode" and navigable history to test your own hypotheses and answer your own questions
- Self-guided Cache Sim Demo posted along with Section 6
- Will be used in hw17 Lab 4 Preparation

# **Polling Question [Cache IV]**

- Which of the following cache statements is FALSE?
  - Vote at <a href="http://pollev.com/rea">http://pollev.com/rea</a>
  - A. We can reduce compulsory misses by decreasing our block size smaller block size pulls fever bytes into \$
    - B. We can reduce conflict misses by increasing associativity

      more options to place blocks before evictions occur
    - C. A write-back cache will save time for code with good temporal locality on writes get evided, so fever write-backs
    - D. A write-through cache will always match data with the memory hierarchy level below it yes, its main main than the memory hierarchy level below it goal is data consistency.
    - E. We're lost...

#### **Optimizations for the Memory Hierarchy**

- Write code that has locality!
  - Spatial: access data contiguously
  - Temporal: make sure access to the same data is not too far apart in time
- How can you achieve locality?
  - Adjust memory accesses in <u>code</u> (software) to improve miss rate (MR)
    - Requires knowledge of both how caches work as well as your system's parameters
  - Proper choice of algorithm
  - Loop transformations

## **Example: Matrix Multiplication**



#### **Matrices in Memory**

- How do cache blocks fit into this scheme?
  - Row major matrix in memory:



COLUMN of matrix (blue) is spread among cache blocks shown in red

#### **Naïve Matrix Multiply**

matrix nxh

```
# move along rows of A
for (i = 0; i < n; i++)
    # move along columns of B
    for (j = 0; j < n; j++)
        # EACH k loop reads row of A, col of B
        # Also read & write c(i,j) n times
        for (k = 0; k < n; k++)
            c[i*n+j] += a[i*n+k] * b[k*n+j];</pre>
```



# Cache Miss Analysis (Naïve)

Ignoring matrix c

- Scenario Parameters:
  - Square matrix  $(n \times n)$ , elements are doubles
  - Cache block size K = 64 B = 8 doubles
  - $\Leftrightarrow$  Cache size  $C \ll n$  (much smaller than n)



Each iteration:

$$\frac{n}{8} + n = \frac{9n}{8}$$
 misses











by the time we get to n+1, block how been kicked out of \$

# Cache Miss Analysis (Naïve)

Ignoring matrix c

- Scenario Parameters:
  - Square matrix  $(n \times n)$ , elements are doubles
  - Cache block size K = 64 B = 8 doubles
  - Cache size  $C \ll n$  (much smaller than n)
- Each iteration:







# Cache Miss Analysis (Naïve)

Ignoring matrix c

- Scenario Parameters:
  - Square matrix  $(n \times n)$ , elements are doubles
  - Cache block size K = 64 B = 8 doubles
  - Cache size  $C \ll n$  (much smaller than n)
- Each iteration:

$$\frac{n}{8} + n = \frac{9n}{8}$$
 misses



\* Total misses: 
$$\frac{9n}{8} \times n^2 = \frac{9}{8}n^3$$
once per product matrix element

#### Linear Algebra to the Rescue (1)

This is extra (non-testable) material

- Can get the same result of a matrix multiplication by splitting the matrices into smaller submatrices (matrix "blocks")
- For example, multiply two 4×4 matrices:

$$A = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{bmatrix} = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}, \text{ with } B \text{ defined similarly.}$$

$$AB = \begin{bmatrix} (A_{11}B_{11} + A_{12}B_{21}) & (A_{11}B_{12} + A_{12}B_{22}) \\ (A_{21}B_{11} + A_{22}B_{21}) & (A_{21}B_{12} + A_{22}B_{22}) \end{bmatrix}$$

# Linear Algebra to the Rescue (2)

This is extra (non-testable) material



L19: Caches IV

Matrices of size  $n \times n$ , split into 4 blocks of size r (n=4r)

$$C_{22} = A_{21}B_{12} + A_{22}B_{22} + A_{23}B_{32} + A_{24}B_{42} = \sum_{k} A_{2k} B_{k2}$$

- Multiplication operates on small "block" matrices
  - Choose size so that they fit in the cache!
  - This technique called "cache blocking" \*

```
Blocked version of the naïve algorithm:

# move by xxx BI.OGEG

# move by xxx BI.OGEG
for (i = 0; i < n; i += r)

for (j = 0; j < n; j += r)

for (k = 0; k < n; k += r)

(k = 0; k < n; k += r)

At a
          # block matrix multiplication
         for (ib = i; ib < i+r; ib++)
for (jb = j; jb < j+r; jb++)</pre>
              for (kb = k; kb < k+r; kb++)
                   c[ib*n+jb] += a[ib*n+kb]*b[kb*n+jb];
```

 $\blacksquare$  r = block matrix size (assume r divides n evenly)

#### Cache Miss Analysis (Blocked)

Ignoring matrix c

- Scenario Parameters:
  - Cache block size K = 64 B = 8 doubles

Cache size  $C \ll n$  (much smaller than n)

■ Three blocks ■  $(r \times r)$  fit into cache:  $3r^2 < C$ 

n/r blocks

 $r^2$  elements per block, 8 per cache block

- Each block iteration:
  - $r^2/8$  misses per block
  - $2n/r \times r^2/8 = nr/4$

n/r blocks in row and column



# **Cache Miss Analysis (Blocked)**

Ignoring matrix c

n/r blocks

X

- Scenario Parameters:
  - Cache block size K = 64 B = 8 doubles
  - Cache size  $C \ll n$  (much smaller than n)
  - Three blocks  $\blacksquare$   $(r \times r)$  fit into cache:  $3r^2 < C$

 $r^2$  elements per block, 8 per cache block



- $r^2/8$  misses per block
- $2n/r \times r^2/8 = nr/4$

n/r blocks in row and column

Afterwards in cache (schematic)



# **Cache Miss Analysis (Blocked)**

Ignoring

Scenario Parameters:

UNIVERSITY of WASHINGTON

- Cache block size K = 64 B = 8 doubles
- Cache size  $C \ll n$  (much smaller than n)
- Three blocks  $\blacksquare$   $(r \times r)$  fit into cache:  $3r^2 < C$

 $r^2$  elements per block, 8 per cache block











n/r blocks

- n/r blocks in row and column
- Total misses:
  - $nr/4 \times (n/r)2 = n^3/(4r)$

CSE351, Spring 2020

# **Matrix Multiply Visualization**

\* Here n = 100, C = 32 KiB, r = 30

A

Naïve:



# Cache misses: 551988

≈ 1,020,000 cache misses

#### **Blocked:**



≈ 90,000 cache misses

#### **Cache-Friendly Code**

- Programmer can optimize for cache performance
  - How data structures are organized
  - How data are accessed
    - Nested loop structure
    - Blocking is a general technique
- All systems favor "cache-friendly code"
  - Getting absolute optimum performance is very platform specific
    - Cache size, cache block size, associativity, etc.
  - Can get most of the advantage with generic code
    - Keep working set reasonably small (temporal locality)
    - Use small strides (spatial locality)
    - Focus on inner loop code



#### **Learning About Your Machine**

#### Linux:

- lscpu
- Is /sys/devices/system/cpu/cpu0/cache/index0/
  - <u>Example</u>: cat /sys/devices/system/cpu/cpu0/cache/index\*/size

#### Windows:

- wmic memcache get <query> (all values in KB)
- Example: wmic memcache get MaxCacheSize
- Modern processor specs: <a href="http://www.7-cpu.com/">http://www.7-cpu.com/</a>