Lecture 26 (Fri 12/05/2008)

• HW #4 (optional) - Due NOW
• Lab #4 Hardware - Due Fri Dec 5 at 5pm

• Final Exam, Monday Dec. 8th, 8:30-10:20 in EEB 037 (here, our regular lecture room) Bring highlighter.
• Today: Wrap Up!

Final Exam

• The final is comprehensive, but most coverage is on material since midterm (I will provide green card again, could have MIPS programming)
  - Microprogramming
  - Pipelining
    - Overall operation
    - Hazards
  - Caching
    - Overall operation and design space
  - Performance
  - VM/Paging
  - Interrupts/Exceptions
  - I/O
  - Parallelism (Extra Credit)

• Style is similar to MT: some shorter questions, some longer ones
• Reading is on the lectures page. Study the slides, review the reading, exercises in lecture, HW4.

Instant replay

• This quarter was split into roughly four parts.
  - 1st we covered instruction set architectures—the connection between software and hardware.
  - 2nd we discussed processor design. We focused on pipelining, which is one of the most important ways of improving processor performance.
  - 3rd we focused on large and fast memory systems (via caching), virtual memory, and I/O.
  - Finally, we discussed performance tuning (HW3), and exploiting data parallelism via SIMD and Multi-Core processors.
• We also introduced many performance metrics to estimate the actual benefits of all of these fancy designs.

Some recurring themes

• There were several recurring themes throughout the quarter:
  - Instruction set and processor designs are intimately related.
  - Parallel processing can often make systems faster.
  - Performance and Amdahl’s Law quantifies performance limitations.
  - Hierarchical designs combine different parts of a system.
  - Hardware and software depend on each other.
Instruction sets and processor designs

- The MIPS instruction set was designed for pipelining.
  - All instructions are the same length, to make instruction fetch and jump and branch address calculations simpler.
  - Opcode and operand fields appear in the same place in each of the three instruction formats, making instruction decoding easier.
  - Only relatively simple arithmetic and data transfer instructions are supported.
- These decisions have multiple advantages.
  - They lead to shorter pipeline stages and higher clock rates.
  - They result in simpler hardware, leaving room for other performance enhancements like forwarding, branch prediction, and on-die caches.

x86 Selected History

<table>
<thead>
<tr>
<th>Processor</th>
<th>Intro Year</th>
<th>Intro Clock</th>
<th>Transistors</th>
<th>Features</th>
</tr>
</thead>
<tbody>
<tr>
<td>8086</td>
<td>1978</td>
<td>8 MHz</td>
<td>29 K</td>
<td>16-bit reg., segments</td>
</tr>
<tr>
<td>286</td>
<td>1982</td>
<td>12.5 MHz</td>
<td>134 K</td>
<td>Protect. mode</td>
</tr>
<tr>
<td>386</td>
<td>1985</td>
<td>20 MHz</td>
<td>275 K</td>
<td>32-bit reg., paging</td>
</tr>
<tr>
<td>486</td>
<td>1989</td>
<td>25 MHz</td>
<td>1.2 M</td>
<td>On-board FPU</td>
</tr>
<tr>
<td>Pentium</td>
<td>1993</td>
<td>60 MHz</td>
<td>3.1 M</td>
<td>MMX on late models</td>
</tr>
<tr>
<td>Pentium Pro</td>
<td>1995</td>
<td>200 MHz</td>
<td>5.5 M</td>
<td>P6 core, bigger caches</td>
</tr>
<tr>
<td>Pentium II</td>
<td>1997</td>
<td>266 MHz</td>
<td>7 M</td>
<td>P6 w/MMX</td>
</tr>
<tr>
<td>Pentium III</td>
<td>1999</td>
<td>700 MHz</td>
<td>28 M</td>
<td>SSE (Streaming SIMD)</td>
</tr>
<tr>
<td>Pentium 4</td>
<td>2000</td>
<td>1.5 GHz</td>
<td>42 M</td>
<td>NetBurst core, SSE2</td>
</tr>
<tr>
<td>Xeon</td>
<td>2001</td>
<td>2.2 GHz</td>
<td>55 M</td>
<td>Hyper-Threaded</td>
</tr>
<tr>
<td>Pentium M</td>
<td>2003</td>
<td>1.6 GHz</td>
<td>77 M</td>
<td>Shorter pipelines vs P4</td>
</tr>
</tbody>
</table>

x86 Selected History

<table>
<thead>
<tr>
<th>Processor</th>
<th>Intro Year</th>
<th>Intro Clock</th>
<th>Transistors</th>
<th>Features</th>
</tr>
</thead>
<tbody>
<tr>
<td>8086</td>
<td>1978</td>
<td>8 MHz</td>
<td>29 K</td>
<td>16-bit reg., segments</td>
</tr>
<tr>
<td>286</td>
<td>1982</td>
<td>12.5 MHz</td>
<td>134 K</td>
<td>Protect. mode</td>
</tr>
<tr>
<td>386</td>
<td>1985</td>
<td>20 MHz</td>
<td>275 K</td>
<td>32-bit reg., paging</td>
</tr>
<tr>
<td>486</td>
<td>1989</td>
<td>25 MHz</td>
<td>1.2 M</td>
<td>On-board FPU</td>
</tr>
<tr>
<td>Pentium</td>
<td>1993</td>
<td>60 MHz</td>
<td>3.1 M</td>
<td>MMX on late models</td>
</tr>
<tr>
<td>Pentium Pro</td>
<td>1995</td>
<td>200 MHz</td>
<td>5.5 M</td>
<td>P6 core, bigger caches</td>
</tr>
<tr>
<td>Pentium II</td>
<td>1997</td>
<td>266 MHz</td>
<td>7 M</td>
<td>P6 w/MMX</td>
</tr>
<tr>
<td>Pentium III</td>
<td>1999</td>
<td>700 MHz</td>
<td>28 M</td>
<td>SSE (Streaming SIMD)</td>
</tr>
<tr>
<td>Pentium 4</td>
<td>2000</td>
<td>1.5 GHz</td>
<td>42 M</td>
<td>NetBurst core, SSE2</td>
</tr>
<tr>
<td>Xeon</td>
<td>2001</td>
<td>2.2 GHz</td>
<td>55 M</td>
<td>Hyper-Threaded</td>
</tr>
<tr>
<td>Pentium M</td>
<td>2003</td>
<td>1.6 GHz</td>
<td>77 M</td>
<td>Shorter pipelines vs P4</td>
</tr>
</tbody>
</table>

Registers in x86

Addressing Memory

Specifying a memory address:
- up to 2 registers and 1 32-bit signed constant can be added together to compute a memory address. One register can be optionally pre-multiplied by 2, 4, 8.
- mov eax, ebx
- mov eax, [ebx]
- mov [var], ebx
- mov eax, [esi -4]
- mov [esi+eax], cl
- mov edx, [esi+4*ebx]
- Incorrect: (why?)
- mov eax, [ebx - ecx]
- mov [eax+esi+edi], ebx
- mov [4*eax+2*ebx], ecx

Figure 1. The x86 register set.
### Parallel processing

- One way to improve performance is to do more processing at once.
- There were several examples of this in our CPU designs:
  - Multiple functional units can be included in a datapath to let single instructions execute faster. For example, we can calculate a branch target while reading the register file.
  - Pipelining allows us to overlap the executions of several instructions.
  - SIMD performance operations on multiple data items simultaneously.
  - Multi-core processors enable thread-level parallel processing.
- Memory and I/O systems also provide many good examples:
  - A wider bus can transfer more data per clock cycle.
  - Memory can be split into banks that are accessed simultaneously. Similar ideas may be applied to hard disks, as with RAID systems.
  - A direct memory access (DMA) controller performs I/O operations while the CPU does compute-intensive tasks instead.

### Performance and Amdahl’s Law

- First Law of Performance: Make the common case fast!
- But, performance is limited by the slowest component of the system.
- We’ve seen this in regard to cycle times in our CPU implementations:
  - ...
  - ...
  - ...
- Amdahl’s Law also holds true outside the processor itself:
  - ...
  - ...
  - ...
Performance and Amdahl's Law

- First Law of Performance: Make the common case fast!
- But, performance is limited by the slowest component of the system.
- We've seen this in regard to cycle times in our CPU implementations.
  - Single-cycle clock times are limited by the slowest instruction.
  - Pipelined cycle times depend on the slowest individual stage.
- Amdahl's Law also holds true outside the processor itself.
  - Slow memory or bad cache designs can hamper overall performance.
  - I/O bound workloads depend on the I/O system's performance.

Hierarchical designs

- Hierarchies separate fast and slow parts of a system, and minimize the interference between them.
  - Caches are fast memories which speed up access to frequently-used data and reduce traffic to slower main memory. (Registers are even faster...)
  - Buses can also be split into several levels, allowing higher-bandwidth devices like the CPU, memory and video card to communicate without affecting or being affected by slower peripherals.

Architecture and Software

- Computer architecture plays a vital role in many areas of software.
- Compilers are critical to achieving good performance.
  - They must take full advantage of a CPU's instruction set.
  - Optimizations can reduce stalls and flushes, or arrange code and data accesses for optimal use of system caches.
- Operating systems interact closely with hardware.
  - They should take advantage of CPU features like support for virtual memory and I/O capabilities for device drivers.
  - The OS handles exceptions and interrupts together with the CPU.
Compiler Structure

Instruction Scheduling

- **Goal**: Find a schedule of instructions that is optimal for a pipelined machine
- **First organize instructions in a DAG (instructions, dependences)**.
  - Flow dependence
  - Anti-dependence
  - Output dependence
- Any topological sort of the graph is legal, but we want one that minimizes overall delay (NP-hard).
- So instead use heuristics to traverse and minimize as much as possible.

Register Allocation

- Equivalent to graph coloring (NP-hard)
  1. Identify virtual registers that CANNOT share a physical register.
  2. Calculate live ranges
  3. Build a graph
  4. Color the graph

Block 2:

<table>
<thead>
<tr>
<th>Block 2</th>
<th>Scheduled:</th>
</tr>
</thead>
<tbody>
<tr>
<td>1. v19 = v13 x v18</td>
<td>v19 = v13 x v18</td>
</tr>
<tr>
<td>2. v13 = v19 div v17</td>
<td>v13 = v19 div v17</td>
</tr>
<tr>
<td>3. v17 = v17 + 1</td>
<td>v17 = v17 + 1</td>
</tr>
<tr>
<td>4. v28 = v13</td>
<td>v28 = v13</td>
</tr>
<tr>
<td>5. v34 = v34 - 4</td>
<td>v34 = v34 - 4</td>
</tr>
<tr>
<td>6. v18 = v18 - 1</td>
<td>v18 = v18 - 1</td>
</tr>
</tbody>
</table>

Block 3:

<table>
<thead>
<tr>
<th>Block 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>v43 = v17 &lt; v42</td>
</tr>
<tr>
<td>t v43 goto Block 2</td>
</tr>
<tr>
<td>__else fall through to Block 4</td>
</tr>
</tbody>
</table>
Five things that I hope you will remember

- **Abstraction**: the separation of interface from implementation.
  - ISA’s specify what the processor does, not how it does it.

- **Locality**:
  - **Temporal Locality**: “if you used it, you’ll use it again”
  - **Spatial Locality**: “if you used it, you’ll use something near it”

- **Caching**: buffering a subset of something nearby, for quicker access
  - Typically used to exploit locality.

- **Indirection**: adding a flexible mapping from names to things
  - Virtual memory’s page table maps virtual to physical address.

- **Throughput vs. Latency**: (# things/time) vs. (time to do one thing)
  - Improving one does not necessitate improving the other.

Where to go from here?

- **CSE 401 Introduction to Compiler Construction (3)**
  Fundamentals of compilers and interpreters; symbol tables; lexical analysis, syntax analysis, semantic analysis, code generation, and optimizations for general purpose programming languages. Prerequisite: CSE 322; CSE 326; CSE 341; CSE 378.

- **CSE 451 Introduction to Operating Systems (4)**
  Principles of operating systems. Process management, memory management, auxiliary storage management, resource allocation. Prerequisite: CSE 326; CSE 378.

- **CSE 471 Computer Design and Organization (4)**
  CPU instruction addressing models, CPU structure and functions, computer arithmetic and logic unit, register transfer level design, hardware and microprogram control, memory hierarchy design and organization, I/O and system components interconnection. Laboratory project involves design and simulation of an instruction set processor. Prerequisite: CSE 370; CSE 378.

- **CSE 466 Software for Embedded Systems (4)**
  Software issues in the design of embedded systems. Microcontroller architectures and peripherals, embedded operating systems and device drivers, compilers and debuggers, timer and interrupt systems, interfacing of devices, communications and networking. Emphasis on practical application of development platforms. Prerequisite: CSE 326; CSE 370; CSE 378.