













|             | 21164 Instruction Unit Pipeline                                       |       |
|-------------|-----------------------------------------------------------------------|-------|
| Fetch 8     | issue                                                                 |       |
| <b>S0</b> : | instruction fetch                                                     |       |
|             | branch prediction bits read                                           |       |
| <b>S1</b> : | opcode decode                                                         |       |
|             | target address calculation                                            |       |
|             | if predict taken, redirect the fetch                                  |       |
|             | instruction TLB check                                                 |       |
| <b>S2</b> : | instruction slotting: decide which of the next 4 instructions ca      | an be |
|             | <ul> <li>intra-cycle structural hazard check</li> </ul>               |       |
|             | <ul> <li>intra-cycle data hazard check</li> </ul>                     |       |
| <b>S3</b> : | instruction dispatch                                                  |       |
|             | <ul> <li>inter-cycle load-use &amp; WAW data hazard checks</li> </ul> |       |
|             | <ul> <li>inter-cycle structural hazard check</li> </ul>               |       |
|             | <ul> <li>register read</li> </ul>                                     |       |
|             |                                                                       |       |
| Fall 2004   | CSE 471                                                               | 8     |











|                                                                        | <u>Superscalars</u>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |    |
|------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| Perfo<br>•<br>•<br>•<br>•<br>•<br>•<br>•<br>•<br>•<br>•<br>•<br>•<br>• | rmance impact:<br>increase performance because execute multiple instructions in<br>parallel, not just overlapped<br>CPI potentially < 1 (.5 on our R3000 example)<br>IPC (instructions/cycle) potentially > 1 (2 on our R3000 example)<br>better functional unit utilization<br>need to fetch more instructions – how many?<br>need independent instructions (i.e., good ILP) – why?<br>need a good local mix of instructions – why?<br>need more instructions to hide load delays – why?<br>need to make better branch predictions – why? |    |
| Fall 2004                                                              | 4 CSE 471                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 14 |



| -     | al code<br>lw R1, 0(R5)<br>addu R1, R1, R6<br>sw R1, 0(R5)<br>addi R5, R5, -4<br>bne R5, R0, Loop |         | <pre>tency-hiding code scheduling lw R1, 0(s1) addi R5, R5, -4 addu R1, R1, R6 sw R1, 4(R5) bne R5, \$0, Loop</pre> |             |
|-------|---------------------------------------------------------------------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------|-------------|
|       | ALU/branch instruction                                                                            | Data tr | ansfer instruction                                                                                                  | clock cycle |
| loop: |                                                                                                   |         |                                                                                                                     | 1           |
|       |                                                                                                   |         |                                                                                                                     | 2           |
|       |                                                                                                   |         |                                                                                                                     | 3           |
|       |                                                                                                   |         |                                                                                                                     | 4           |
|       |                                                                                                   |         |                                                                                                                     |             |

|       | ALU/branch instruction                                  | Data transfer instruction  | clock cycle |
|-------|---------------------------------------------------------|----------------------------|-------------|
| Loop: | addi R5, R5, <mark>-16</mark>                           | lw R1, 0(R5)               | 1           |
|       |                                                         | lw R2, 12(R5)              | 2           |
|       | addu R1, R1, R6                                         | lw R3, <mark>8</mark> (R5) | 3           |
|       | addu R2, R2, R6                                         | lw R4, <b>4</b> (R5)       | 4           |
|       | addu R3, R3, R6                                         | sw R1, 16(R5)              | 5           |
|       | addu R4, R4, R6                                         | sw R2, 12(R5)              | б           |
|       |                                                         | sw R3, 8(R5)               | 7           |
|       | bne R5, R0, Loop                                        | sw R4, <mark>4</mark> (R5) | 8           |
|       | the cycles per iteration?                               |                            |             |
|       | Inrolling provides:                                     |                            |             |
| +     | fewer instructions that cause h                         | azards (I.e., branches)    |             |
|       |                                                         |                            |             |
|       | more independent instructions                           |                            |             |
| +     | more independent instructions<br>increase in throughput |                            |             |

|           | <u>Superscalars</u>                                                                                                                                                            |                                        |  |
|-----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|--|
| 1 0       | ers for multiple register acc<br>ne register file to the addition<br>tion logic<br>tch<br>ta cache<br>structural hazards (due to<br>histruction types that can be<br>hardware. | onal functional units<br>an unbalanced |  |
| Fall 2004 | CSE 471                                                                                                                                                                        | 18                                     |  |

| Modern Superscalars                                                                                                           |         |    |  |  |
|-------------------------------------------------------------------------------------------------------------------------------|---------|----|--|--|
| Alpha 21364: 4 instructions<br>Pentium IV: 5 RISClike operation<br>R12000: 4 instructions<br>UltraSPARC-3: 6 instructions dis |         |    |  |  |
| Fall 2004                                                                                                                     | CSE 471 | 19 |  |  |