Lab 3B: Hazards and Forwarding (SW)

Assigned: 4/27/2007
Due: 6:00pm, 5/10/2007

Description

In Lab 3 the pipelined processor was designed to work with independent instructions. This sorely limited the types of programs that could run successfully, or required solutions like inserting nop instructions to separate dependent instructions. Unfortunately such dependencies are common and often unavoidable. This lab will focus on hardware techniques to recognize and deal with dependencies. Not all dependencies can be handled easily, so this lab also introduces the idea of inserting bubbles to deal with branch and load hazards.

Background:

The pipeline from Lab 3 made no concessions for pipeline hazards. A hazard is a case where the next instruction cannot execute on the next clock cycle. There are three broad categories of hazards: structural hazards, control hazards, and data hazards. (See COD:3e 375-383 for more).

Phase 0: Administration

This lab will take place in the same workspace as the previous labs. The files for this lab are provided as an archived design. You will need to restore the design, add it to the workspace, and copy several design files from lab3 into the new design. Download the archived design lab3-debug_dist.zip and follow these steps:

Download lab3-debug_dist.zip.
Start Active-HDL and Open the cse378 workspace used in previous labs.
Select Design > Restore Design from the menu
Browse to the downloaded lab3-debug_dist.zip file
Set the Restore To: directory to the cse378 directory that contains the folders lab1, lab2, lab3 and lib378 and click Finish.

At this point the files are all available. Now we need to tell Active-HDL about the new design.

Right-Click the yellow workspace icon name cse378 from the design browser and choose "Add Existing Design to Workspace".
Find and open the file "lab3-debug_dist.adf" in the newly restored lab2 directory.
Set lab3-debug_dist as the active design
Click and Drag pcaddresscomputer.v, controller.v, piperegisters.v, comparator.v and cpu.bde from lab3 to lab3-debug_dist.

Phase 1: Forwarding

The pipeline registers that we added in Lab 3 made it possible to decrease the clock period because the amount of work in a given cycle decreased. Concomitantly, instructions now take 5 cycles to complete. This means that the values available in the register file may be "stale". This happens because the result of an operation is not available to later instructions until the instruction reaches the Write-Back (WB) stage.

The solution to this problem is to pass the "fresh" values for registers back to earlier stages. These "fresh" values can then replace the "stale" ones from the register file. This practice is called Forwarding or Bypassing (we'll use Forwarding ). (See COD:3e 402-412 for more)

WARNING: The test fixtures are very name-sensitive. Double-check all wire and component names.

Part A. Execution Forwarding

Consider the following code sequence:

add $5,$3,$4
add $6,$5,$6
add $7,$7,$5

Using the processor from Lab 3 this would run into two issues, because the second and third add instructions would read the old value for $r5. When the second instruction is in the Execute stage, the new value of $r5 is stored in the EX/MEM register. So we need a way to use that value for the computation of $r6. When the third add instruction is in the Execute stage, the updated value of $r5 is stored in the MEM/WB register. This means we'll need a multiplexer on the A input to the ALU to choose between the new value of $r5 and the one read from the register file.

Task i: Execution Forwarding Logic

The fundamental idea in forwarding is choosing the most recent value for a specific register. In hardware a choice is synonymous with a multiplexer. In this case the multiplexers used to compute the return address in the JAL and JALR instructions will be replaced by larger versions that allow more choices.

Add an output port name Load. Connect it up (Should be analagous to Store)
Replace the 2:1 multiplexer that chooses between EX_NextPC and EX_RS_IN with a m32w_4_1. Name the control input Fwd_EX_RS(1:0) and use the following connections
1. EX_RS_IN(31:0)
2. EX_NextPC(31:0)
3. Value from MEM Stage (31:0)
4. Value from WB Stage (31:0)
Replace the 2:1 multiplexer that chooses between constant 4 and EX_RT_IN with a m32w_4_1. Name the control input Fwd_EX_RT(1:0) and use the following connections
1. EX_RT_IN(31:0)
2. constant 4
3. Value from MEM Stage (31:0)
4. Value from WB Stage (31:0)

Task ii: Execution Forwarding Detection

The multiplexers allow the most recent value to be used. Determining when this applies is slightly more complex. Use the provided skeleton file forwardingunit.v and add logic to compute the control signals Fwd_EX_RS, and Fwd_EX_RT. Don't forget that the JAL and JALR computation also uses the same multiplexers.

Each instruction that updates the register file specifies its destination register. If that destination register matches one of the source registers of the next two instructions, then there's a potential conflict. Other things to keep in mind:

Don't forward if the destination register is $zero. ( result is always 0 )
Don't forward if the instruction doesn't update the register file.
Forward from the closest instruction. MEM takes precedence over WB.

Task iii: Testing

To verify the correctness of your implementation, use the script phase1a_tf_runtest.do in the Tests directory. Before you do, make the following change:

Open cpu_wrapper.v
Replace the assignment to IF_Stall with this: IF_Stall = 0; // CPU.IDEXReg.Clear
Save, and run the test.

Part B. Branch Forwarding

The dependencies between computational instructions were handled in part A, but branch instructions and the JR and JALR also suffer from hazards. For the branches, it's essential that the correct values are compared to determine the outcome. For JR and JALR, the issue is jumping to the correct destination. In most cases, we can avoid branch hazards by forwarding (the other cases are addressed in Phase 2). The new data must be forwarded to the ID stage.

Note: DO NOT add a forwarding path from EX to ID

Task i: Branch Forwarding Logic

The logic is similar to execution forwarding logic, but less complex. The reason is that the register file is write-first, meaning that values in the WB stage are written to the register file before the outputs from the register file are read in the ID stage. This means that values in the WB stage are available in the ID on the same cycle, so there is no need to forward to the ID stage from the WB stage.

Add a m32w_2_1 multiplexer between the Regfile and Comparator. Call the control signal Fwd_ID_RS.
Add a m32w_2_1 multiplexer between the Regfile and Comparator. Call the control signal Fwd_ID_RT.
Change the input to the PCAddressComputer to use the forwarded value of RS (if necessary)

Task ii: Branch Forwarding Detection

The detection of forwarding in the branch case is virtually identical to the process for the execution forwarding. Specific issues:

WB_WriteReg is available through the regfile
JR and JALR only forward to the RS input

Finish implementing the forwardingunit to produce the Fwd_ID_RS and Fwd_ID_RT control signals.

Task iii: Testing

To verify the correctness of your implementation, use the script phase1b_tf_runtest.do in the Tests directory.

Part C: Integrating the Forwarding Unit

This should be a formality if all the names have been kept correctly.

Open cpu.bde and use the Symbols Toolbox to add an instance of your forwardingunit
Make sure that PCAddressComputer, Controller, and forwardingunit use the same naming convention for JR and Branch
Right-Click the new unit, and choose "Add Stubs" to create wires
Save, Compile, and test using phase1c_tf_runtest.do in the Tests directory.

Phase 2: Hazards and Bubbles

The forwarding logic addresses most of the problems caused by dependencies, but there are still two cases that are two expensive to handle with forwarding. Instead, the dependent instructions waits in the ID stage until the forwarding mechanism is able to take effect. To enforce this delay we introduce a bubble into the pipeline.

Part A: Bubbles

A bubble serves as a place-holder and does not update any visible state. The instruction nop has the same behavior, but there is an essential difference. The difference is that nop is specified by the programmer or compiler, while a bubble is dynamically inserted in the pipeline to preserve proper execution behavior. Specifically, the pipeline inserts a bubble when no forwarding path exists.

Inserting a bubble requires two things:

preventing the PC and IDEXReg from updating
prevent the instruction in ID from changing processor state

Task i: Flushing IDEXReg

The only ESSENTIAL changes to insert a bubble are to disable RegWrite and Store. These signals control the persistent state in the processor. For clarity we will make a more dramatic move and zero the entire IDEXReg

Add an input port named Clear to IDEXReg
Modify the register to zero all signals when Clear is asserted
Save, Compile, and open cpu.bde.
Do "Compare Symbol with Contents" to update the IDEXReg symbol.
Open cpu_wrapper.v
Replace the assignment to IF_Stall with: IF_Stall = CPU.IDEXReg.Clear;
Save and Compile cpu_wrapper.v

Task ii: Stalling Instruction Fetch

The pipeline registers and PC all have a signal called LoadEnable. When this signal is set, the register updates on the rising edge of the clock. In Lab3 these signals were all wired to logical 1.

Delete wire LoadEnable from EXMEMReg, and MEMWBReg
Change the wire named LoadEnable on the VCC to Always
Add new wires named LoadEnable to PC, IFIDReg, and IDEXReg
Change EXMEMReg and MEMWBReg to use Always as the enable signal

Part B: Load Hazards

The logic we added in Phase 1 passes ALU results to later instructions to prevent errors caused by dependencies between instructions. This mechanism eliminates the majority of conflicts between neighboring instructions, but fails when later instructions depend on a load instructions. Consider the following code:

lb $r2, 0($r9)
andi $r3, $r2, 0xFA
...

This code loads a byte from memory, applies a mask (via the andi) and then branches based on the result. The problem is that the standard forwarding path from the MEM to EX stage passes the ALU result. When the load is in the MEM stage, the ALU result is the address of the load, not the resulting value.

The solution is to insert a bubble between the Load instruction and the dependent operation. Inserting a bubble means that the lb is in the WB stage ( with result in the MEM/WB register ) when the andi reaches the EX stage. At that point the standard forwarding mechanism takes over.

The hazardunit.v is a skeleton for a hazard detection unit. If a hazard is detected, the LoadEnable signal goes low, and the IDEXReg_Clear signal is asserted.

Add logic to detect Load / EX hazards.

Part C: Branch Hazards

The branch decision takes place in the ID stage to remove the need for branch prediction. The problem with this decision is that it introduces a hazard. Consider the following code:

add $r3, $r2, $r3
add $r2, $r4, $r6
beq $r3, $r5, success
...
add $r2, $r2, $r3
beq $r3, $r2, success

The first branch will complete properly because the both $r3 can be forward from the MEM stage to the ID stage. The second branch will fail because there is no forwarding path from EX to ID. The solution is to detect the second case and insert a bubble. This example illustrates the most common branch hazard.

Unfortunately, branches are also affected by the load hazards covered in Part B above. Consider the following code:

lw $r4, 0($r6)
bne $r4, $0, loop
...

This sequence of code requires two bubbles to be inserted. The base branch hazard accounts for the first bubble. That leaves the branch in the ID stage, while the Load is in the MEM Stage. The new value of $r4 is still coming from memory, but the bne need $r4 to make the branch decision. The solution is to insert a second bubble and wait an additional cycle. At that point the Load is in the ID stage, and $r4 is available to the branch via through the register file.

Extend hazardunit.v to detect EX/ID hazard
Extend hazardunit.v to detect MEM/ID hazard (Only for Loads)

Part D: Testing

Test hazardunit in isolation using phase2a_tf_runtest.do.
Open cpu.bde and integrate a hazardunit.
Test the integration with test bench through the following steps:
- Download the update package and unpack it to the src directory in your design.
- Compile all the files in the project.
- Set test-board.bde as the top level for simulation.
- Run the script to make a hex file of your program that can be used for simulation by typing 'runscript tests/generateSerialTFInput.do assembly/your_assembly_file.s' in the AHDL console.
- Open the eclab_wave.awf file and have it open for simulation. Add any wires that you are interested in to this file.
- When you start the simulation, it will simulate booting into the BIOS and loading your program into memory across the serial port. This will take quite a bit of time in simulation, so you will probably want to simulate at least 10,000 ns as your program will likely not be fully loaded before this point. When the PC goes to 0x00400000, your program has been loaded and started up by the BIOS.
- Verify that your program works in simulation. You may want to try a variety of programs that use different types of instructions to make sure that all the parts of your processor are operational.

Turnin

To turn in your lab, complete all phases and use the Design -> Archive Design command in ActiveHDL on the final product to produce a .zip file containing all the essential files. Put this somewhere accessible via attu and use "turnin -c cse378 -p lab3b 'your file'" to submit it for grading. Late turnins should be submitted via "turnin -c cse378 -p lab3b_late 'your file'".