### **Caches III** CSE 351 Autumn 2017 #### **Instructor:** Justin Hsia #### **Teaching Assistants:** **Lucas Wotton** Michael Zhang Parker DeWilde Ryan Wong Sam Gehman Sam Wolfson Savanna Yee Vinny Palaniappan https://what-if.xkcd.com/111/ #### **Administrivia** - Midterm regrade requests due end of tonight - Lab 3 due Friday - \* HW 4 is released, due next Friday (11/17) - No lecture on Friday Veteran's Day! # Making memory accesses fast! - Cache basics - Principle of locality - Memory hierarchies - Cache organization - Direct-mapped (sets; index + tag) - Associativity (ways) - Replacement policy - Handling writes - Program optimizations that consider caches # **Associativity** - What if we could store data in any place in the cache? - More complicated hardware = more power consumed, slower - So we combine the two ideas: - Each address maps to exactly one set - Each set can store block in more than one way # Cache Organization (3) **Note:** The textbook uses "b" for offset bits - $\star$ Associativity (E): # of ways for each set - Such a cache is called an "E-way set associative cache" - We now index into cache sets, of which there are C/K/E = S sets - Use lowest $\log_2(C/K/E) = s$ bits of block address - Direct-mapped: E = 1, so $s = \log_2(C/K)$ as we saw previously - Fully associative: E = C/K, so s = 0 bits Used for tag comparison Tag (t)Decreasing associativity Direct mapped (only one way) Used for tag comparison Uselects the set (a) Selects the byte from block Tag (t)Index (s)Offset (k)Fully associative (only one set) # **Example Placement** block size: 16 B capacity: 8 blocks address: 16 bits - \* Where would data from address $0 \times 1833$ be placed? - Binary: 0b 0001 1000 0011 0011 *m*-bit address: t = m - s - k $s = log_2(C/K/E)$ $k = log_2(K) = 4$ Tag (t) Index (s) Offset (k) $s = ? log_2(8/1) = 3$ $s = ? log_2(8/2) = 2$ $s = ? log_2(8/4) = 1$ Direct-mapped (E=1) 2-way set associative (E=2) 4-way set associative (E=4) | Set | Tag | Data | Set | Tag | Data | Set | Tag | Data | |---------|-----|--------------|-------|-----|------------|------|-----|--------------| | (ഗഗ) 0 | | | (()) | | | | | | | (001) 1 | | | (00)0 | | | (2)0 | | | | (010) 2 | | | (21)1 | | | (6)0 | | | | (III) 3 | | $\checkmark$ | (01)1 | | | | | | | (۱သ) 4 | | | (12)2 | | | | | $\checkmark$ | | (101) 5 | | | (10/2 | | | 6001 | | $\checkmark$ | | (110) 6 | | | 6(11) | _ | \ <u>\</u> | (1)1 | | <b>V</b> | | (m) 7 ( | | | (175 | | <b>✓</b> | | | $\vee$ | # **Block Replacement** - Any empty block in the correct set may be used to store block - If there are no empty blocks, which one should we replace? - No choice for direct-mapped caches - Caches typically use something close to least recently used (LRU) (hardware usually implements "not most recently used") | Direct-mapped | |---------------| |---------------| | Set | Tag | Data | |-----------------------|-----|------| | 0 | | | | 1 | | | | 2 | | | | 3 | | | | 4 | | | | 2<br>3<br>4<br>5<br>6 | | | | 6 | | | | 7 | | | 2-way set associative | Set | Tag | Data | |-----|-----|------| | 0 | | | | 1 | | | | 2 | | | | 3 | | | 4-way set associative | Set | Tag | Data | |-----|-----|------| | | | | | 0 | | | | | | | | | | | | 1 | | | | | | | #### **Peer Instruction Question** - We have a cache of size 2 KiB with block size of 128 B. If our cache has 2 sets, what is its associativity? - Vote at <a href="http://PollEv.com/justinh">http://PollEv.com/justinh</a> cache holds C/K=2<sup>11-7</sup>=2<sup>4</sup>=16 blocks \* If addresses are 16 bits wide, how wide is the Tag field? $k = log_2(K) = 7 \text{ bits}$ , $s = log_2(S) = 1 \text{ bits}$ # General Cache Organization (S, E, K) #### **Notation Review** - We just introduced a lot of new variable names! - Please be mindful of block size notation when you look at past exam questions or are watching videos | Variable | This Quarter | Formulas | |--------------------|----------------------------------------------|---------------------------------------------------| | Block size | K (B in book) | | | Cache size | С | $M = 2^m \leftrightarrow m = \log_2 M$ | | Associativity | E | $S = 2^{s} \leftrightarrow \mathbf{s} = \log_2 S$ | | Number of Sets | S | $K = 2^{k} \leftrightarrow k = \log_2 K$ | | Address space | М | $C = K \times E \times S$ | | Address width | m | $\mathbf{s} = \log_2(C/K/E)$ | | Tag field width | t | m = t + s + k | | Index field width | S | | | Offset field width | $\boldsymbol{k}$ ( $\boldsymbol{b}$ in book) | | Locate set Check if any line in set #### **Cache Read** # Example: Direct-Mapped Cache (E = 1) Direct-mapped: One line per set Block Size K = 8 B # Example: Direct-Mapped Cache (E = 1) Direct-mapped: One line per set Block Size K = 8 B # Example: Direct-Mapped Cache (E = 1) Direct-mapped: One line per set Block Size K = 8 B No match? Then old line gets evicted and replaced no unnecessary extra cache accesses across block boundaries # Example: Set-Associative Cache (E = 2) 1 bit shorter 2-way: Two lines per set Address of short Block Size K = 8 B**t** bits 0...01 100 find set 161 3 3 0 5 0 tag set 2 5 6 tag 6 5 tag # Example: Set-Associative Cache (E = 2) # Example: Set-Associative Cache (E = 2) #### No match? - One line in set is selected for eviction and replacement - Replacement policies: random, least recently used (LRU), ... # Types of Cache Misses: 3 C's! - Compulsory (cold) miss - Occurs on first access to a block - Conflict miss - Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot - e.g. referencing blocks 0, 8, 0, 8, ... could miss every time - Direct-mapped caches have more conflict misses than E-way set-associative (where E > 1) - Capacity miss - Occurs when the set of active cache blocks (the working set) is larger than the cache (just won't fit, even if cache was fullyassociative) - Note: Fully-associative only has Compulsory and Capacity misses #### What about writes? - Multiple copies of data exist: - L1, L2, possibly L3, main memory - \* What to do on a write-hit? (block/data already in \$) - Write-through: write immediately to next level - Write-back: defer write to next level until line is evicted (replaced) - Must track which cache lines have been modified ("dirty bit") < extra management bit billy for write-back \$ - \* What to do on a write-miss? (block lasta not currently in \$) - Write-allocate: ("fetch on write") load into cache, update line in cache - Good if more writes or reads to the location follow - No-write-allocate: ("write around") just write immediately to memory - Typical caches: - Write-back + Write-allocate, usually - Write-through + No-write-allocate, occasionally tag (there is only one set in this tiny cache, so the tag is the entire block address!) In this example we are sort of ignoring block offsets. Here a block holds 2 bytes (16 bits, 4 hex digits). Normally a block would be much bigger and thus there would be multiple items per block. While only one item in that block would be written at a time, the entire line would be brought into cache. mov 0xFACE, F Memory F OxCAFE OxBEEF Write hit! Write 0xFEED to cache only mov 0xFACE, F mov 0xFEED, F mov G, %rax #### **Peer Instruction Question** - Which of the following cache statements is FALSE? - Vote at <a href="http://PollEv.com/justinh">http://PollEv.com/justinh</a> - A. We can reduce compulsory misses by decreasing our block size smaller block size pulls fever bytes into \$ - B. We can reduce conflict misses by increasing associativity more options to place docks before evictions occur - C. A write-back cache will save time for code with good temporal locality on writes get evided, so fewer write-backs - D. A write-through cache will always match data with the memory hierarchy level below it yes, its main data - E. We're lost... ## **Example Cache Parameters Problem** ⇒ $2^{2^{\circ}}$ B ⇒ m=20 kHs \* 1 MiB address space, 125 cycles to go to memory. Fill in the following table: | Cache Size (C) | $4 \text{ KiB} = 2^{-1} \text{ B}$ | | | |---------------------------|------------------------------------|--|--| | Block Size (K) | $16 B = 2^4 B$ | | | | Associativity(E) | 4-way = 2 <sup>2</sup> | | | | Hit Time (HT) | 3 cycles | | | | Miss Rate (MR) | 20% | | | | Write Policy | Write-through | | | | <b>Replacement Policy</b> | LRU | | | | Tag Bits | 10 | | | | Index Bits | 6 | | | | Offset Bits | 4 | | | | | AMAT = | | | | AMAT | 3 + 0.2 * 125 = 28 | | | m-s-k log<sub>2</sub>(C/K/E) log<sub>2</sub>(K) HT+MR\*MP # **Example Code Analysis Problem** Overall MR $= \frac{3}{4} \left( \frac{1}{4} \right) + \frac{1}{4} (0) = \frac{3}{16}$ £=10, s=6, k=4 int\_ar[0] accesses first 4 B (offset 0) of a cache block in set O. Assuming the cache starts <u>cold</u> (all blocks invalid), calculate the **miss rate** for the following loop: ``` • m = 20 bits, C = 4 KiB, K = 16 B, E = 4 tholds 2^{12} B of data (half of inter[]) #define AR_SIZE 2048=2" ints = 213 B of lata int int_ar[AR_SIZE], sum=0; for (int i=0; i<AR_SIZE; i++)</pre> sum += int_ar[i]; (1) read i •for (int j=AR_SIZE-1; j>=0; j--) sum += int_ar[i]; 2 real i into [int] [int2 [int3] 16B=4 ints cache block: (compulsory) MR= 14 MR =14 MR>0 ``` ``` Loop 1: never re-visit blocks. first half of loop fills entire $ with data from lower half of intar[]. second half of loop replaces entire $ data with upper half of intar[]. Loop 2: first half of loop was upper half of intar[], which is already in the $ (miss rate of 0). second half of loop replaces entire $ data with ``` // &int\_ar=0x80000