Lab 3: Pipelining

Assigned: 3 November 2007
Due: 16 November 2007

Description

The processor constructed in the previous labs implements a significant part of the MIPS instruction set, but does so very slowly. The problem is that the clock speed is constrained by the longest path a signal can take in a single clock cycle. For the first two labs this path started at the register file, went through the ALU, the memory, and finally back into the register file. In this lab, we'll be shrinking the longest path significantly by breaking the datapath into 5 separate pieces or stages. Adding registers between stages will mean that the longest path a signal must traverse in a single cycle will decrease dramatically. This speed-up will come at a cost, because it will now take 5 cycles for an instruction to complete. To offset this slowdown we'll be able to run independent instructions in each stage, for a maximum of 5 instructions "in-flight", leading to a maximum throughput of 1 instruction per cycle.

As usual, errata for this lab can be found at the lab 3 wiki page.

Background: (pages 370-374, 384-402 from the textbook)

Phase 0: Administration

This lab will take place in the same workspace as the previous labs. The files for this lab are provided as an archived design. You will need to restore the design, add it to the workspace, and copy several design files from lab2 into the new design. Download the archived design lab3.zip and follow these steps:

Download lab3.zip.
Start Active-HDL and Open the cse378 workspace used in previous labs.
Select Design > Restore Design from the menu
Browse to the downloaded lab3.zip file
Set the Restore To: directory to the cse378 directory that contains the folders lab1, lab2 and lib378 and click Finish.

At this point the files are all available. Now we need to tell Active-HDL about the new design.

Right-Click the yellow workspace icon name cse378 from the design browser and choose "Add Existing Design to Workspace".
Find and open the file "lab3.adf" in the newly restored lab2 directory.
Set lab3 as the active design
Click and Drag pcaddresscomputer.v, controller.v and cpu.bde from lab2 to lab3.

Phase 1: Partitioning and Pipeline Registers

The fundamental idea behind pipelining is to separate the datapath into individual stages. Each stage will take one clock cycle to complete, and can contain one instruction. This phase will describe the elements of the stages and the pipeline registers that function as the barrier between stages. A pipeline register saves the values from the previous stage so that each stage can perform an independent instruction. In addition, the pipeline registers must have a reset signal so that their initial state is known, and a load enable signal. The load enable will be used later to ensure that certain registers do not update their values in special cases.

WARNING: The test fixtures are very name-sensitive. Double-check all wire and component names.

Part A. Organizing Stage Logic

The following stage descriptions explain the components that should be in each stage. Please scan the descriptions before making any changes, and be sure to refer back as things progress.

Instruction Fetch (IF) Stage

The first stage is responsible for getting the current instruction from the instruction memory at the address specified by the program counter (PC) . The memory access is a slow process so the resulting instruction goes straight into the pipeline register. In parallel with the fetch, this stage must also compute the next PC address (PC+4). This stage stores the instruction into the IF/ID pipeline register, along with the PC+4 value for use in JAL and JALR instructions.

Create a wire from the Inst input and name it IF_Inst. Use this when you need to reference the instruction in the IF stage.
Create a VCC component and name it LoadEnable

Hit ESC to set cursor to the arrow

Hit (P) to change cursor to VCC

Change PC to use LoadEnable

Right-Click the Program Counter and select "Replace Symbol"

If lib378 appears then skip the next step

Otherwise, Right-click the middle window and choose "Select Libraries", Expand the "User Libraries" and choose lib378

Choose register_re and hit OK

Add a wire LoadEnable to the LD pin of the Program Counter

Create an IFIDReg pipeline register and name it IFIDReg

Use the Symbols Toolbox to add an IFIDReg to cpu.bde

Select the new IFIDReg component, and hit ALT+Enterto open the properties dialog

Set the name to IFIDReg ESSENTIAL!!

Connect PC+4 to the input IF_NextPC of IFIDReg

Connect Inst to IF_Inst of IFIDReg

Use the "Add Stubs" menu option to complete signal generation for IFIDReg

Instruction Decode (ID) Stage

This stage is responsible for the majority of the work. It reads an instruction from the IF/ID pipeline register, decodes the instruction, generates control signals, reads values from the register file and performs comparisons for branches (more on that later). This stage stores control signals for all later stages, data values for A and B inputs to the ALU, the result from the extender unit, the PC+4 value, the destination register index (used in lab 4), and control information needed by the ALU into the ID/EX pipeline register.

Execution Stage (EX) Stage

The ALU is the heart of the processor so it gets a stage all of its own. This stage contains the ALU and multiplexers that determine which values are used as the input to the ALU. It stores the output from the ALU, the data value from the RT field of the instruction, and the destination register to the EX/MEM pipeline register along with necessary control signals.

Memory Access (MEM) Stage

This provides an access to memory for a load or store instruction. The EX/MEM register provides the address and data for the memory as well as control signals. The output from the memory is stored in the MEM/WB register along with the ALU output from the EX/MEM register.

Write-back (WB) Stage

This stage is included to separate memory accesses from the register file. In this stage, the write value for the register value is determined, and the register file is potentially updated based on control signals.

Part B: Instruction Decode (ID) Stage

This part provides a strategy for organizing the ID stage and completing the pipeline register that ends the stage. The file piperegisters.v contains an incomplete IDEXReg module. However, before completing the register there are others issues to address.

Replace the register file with a registerfile2 component. (See Above for more)
Make space between the Regfile and any multiplexers or wires ( Hint: Select wires, hit Shift and drag to separate )
Update buses that include the name Inst to use the name ID_Inst
Disconnect the destination register logic from the register file and name the output bus ID_WriteReg(4:0)
Compose a list of control signals used in the EX, MEM, and WB stages

At this point its time to address the IDEXReg component defined in piperegisters.v. Your task is to define the input and output ports for the additional data needed in the EX stage and the control signals needed in the EX, MEM, and WB stages. As a starting point, ID_RegWrite is an input control signal and EX_RegWrite is the output of the same signal.

Complete the definition of IDEXReg and compile piperegisters.v.
Return to cpu.bde and use the Symbols Toolbox to add an instance of IDEXReg
Set the name to IDEXReg ESSENTIAL!!
Add buses for data and control inputs to IDEXReg
Use "Add Stubs" to generate buses for the rest of the ports.

Part C: Execute (EX) Stage

This part provides a strategy for organizing the EX stage and completing the pipeline register that ends the stage. The file piperegisters.v contains an incomplete EXMEMReg module. However, before completing the register there are others issues to address.

Change the ALUControl inputs to use EX_Inst
Rename Control Signals to use the versions from the IDEXReg
Compose a list of control signals used in the MEM, and WB stages

Your task is to define the input and output ports for the control signals that get passed through the EX stage for use in the MEM and WB stages.

Complete the definition of EXMEMReg and compile piperegisters.v.
Return to cpu.bde and use the Symbols Toolbox to add an instance of EXMEMReg
Set the name to EXMEMReg ESSENTIAL!!
Add buses for data and control inputs to EXMEMReg
Use "Add Stubs" to generate buses for the rest of the ports.

Part D: Memory (MEM) Stage

The memory stage provides the address and data for writing to memory or reading from memory. It sends the value read from memory and the address (ALUOut) on to the WB stage. The file piperegisters.v contains an incomplete MEMWBReg module. First there are some bookkeeping tasks.

Update the output ports that interact with memory to use the values from the EXMEMReg.
Rename to ports that take inputs from Memory to reflect that they occur in this stage.

Once these things are complete its time to finalize the MEMWBReg from piperegisters.v. You need to add ports and logic for the control signals that get passed through the EX stage for use in the WB stage.

Complete the definition of MEMWBReg and compile piperegisters.v.
Return to cpu.bde and use the Symbols Toolbox to add an instance of MEMWBReg
Set the name to MEMWBReg ESSENTIAL!!
Add buses for data and control inputs to MEMWBReg
Use "Add Stubs" to generate buses for the rest of the ports.

Part E: Write-Back (WB) Stage

This stage basically just separates the memory access time from the update to the register file. At this point, all the pipeline registers are in place, and the last task is to rename some signals to ensure that the proper data is being used.

Change the Regfile control signals to reflect their source in the WB stage.
Correct the sources of the mux that determines the WriteData input to the Regfile

Complete the definition of MEMWBReg and compile piperegisters.v.
Return to cpu.bde and use the Symbols Toolbox to add an instance of MEMWBReg
Set the name to MEMWBReg ESSENTIAL!!
Add buses for data and control inputs to MEMWBReg
Use "Add Stubs" to generate buses for the rest of the ports.

Part F: Testing

The cpu.bde has a fairly limited number of input and output ports. To make the test fixtures more effective there is a file cpu_wrapper.v that "peeks" inside the CPU and exposes a wide range of signals for the test fixtures.

Test the updated cpu.bde with test fixture phase1_tf.v. This test fixture runs through the all of the non-control instructions, and verifies that everything is connected properly. Problems with branch or jump instructions will be addressed in the test fixture for Phase 2.

Phase 2: Branching and Delay Slots

The processor from Lab 2 decided the next PC value in every cycle based on the current instruction. Each instruction took a single cycle, so branch instructions would know the resulting PC before the next clock cycle. After pipelining, the control signals are not available until the cycle after an instruction is fetched. This causes a control hazard for branch instructions because we do not know if the branch occurs until the following instruction has been fetched.

There are different ways to deal with this problem, and the MIPS designers decided to turn the hazard into a feature by defining a delay slot. The delay slot is the instruction directly after a branch or jump, and is always executed, regardless of branch outcome. This means that no instructions are squashed on branches, because the branch result is known for the instruction after the delay slot. (See COD:3e pp 423-424 for more on delay slots)

Part A. Branch Comparisons

In the preceding labs, branch comparisons were performed by forcing a subtract in the ALU. For single-cycle machines this was an efficient reuse of logic. In a pipelined machine we want to make the branch decision in the ID stage, so we need the result of the comparison in the ID stage.

Add a comparator component to the ID stage of your pipeline

Compare the ID_RS_OUT and ID_RT_OUT buses

Name the output signal ID_EQ

Replace uses of Zilch ( from ALU ) with ID_EQ

Part B. PC Address Computation

In labs 1 and 2 the branch address was computed by adding the offset to the address of the next instruction. The simplest way to do this computation was to add the branch offset to PC+4. That would ensure that the proper instruction was targeted. Now, the branch decision is delayed one cycle, meaning that PC+4 is actually two instructions after the branch. There are a number of ways to compute the correct branch target address. The deciding factor is the complexity of the logic required.

The minimal solution is to pass both PC and PC+4 (IF_NextPC) to the pcaddresscomputer. Branch and jump address computations use PC, while PC+4 is used as the default, or for branches that aren't taken.

Open pcaddresscomputer.v and add new input port NextPC

Change the logic to use NextPC in non-branch / non-jump cases

Save and compile pcaddresscomputer.v

Open cpu.bde, right-click pcaddresscomputer and select "Compare symbol with contents"

In the new window check the button "Update Symbol inside Block Diagram" and hit OK.

right-click the pcaddresscomputer and use "Edit" or "Edit Symbol in Other Window" to fix the pins

Make the proper wiring changes

Part C. Delay Slots and Return Addresses

    The JAL and JALR instructions store the address of the "next" instruction to the register file. In Lab 2 the "next" instruction was the immediately following instruction, meaning that PC+4 was the stored return address.

    The addition of the delay slot changes this behavior, because the instruction directly after the JAL or JALRis in the delay slot. Logically, it takes place before the jump into the subroutine. So, the "next" instructions is the instruction after the delay slot. This means that these instructions should store PC+8. Rather than add a separate +8 adder take advantage of the following trivial equivalence:

        PC + 8 = (PC + 4) + 4 = NextPC + 4

Compute NextPC+4 in EX stage. The computation should be performed in the ALU; do not add a separate adder.

Part C. Testing

Test the updated cpu.bde with test fixture phase2_tf.v. This test fixture is designed to exercise the branch and jump instructions to ensure that the correct comparisons were made and that the return addresses are being computed properly.

Turnin
To turn in your lab, complete all phases and use the Design -> Archive Design command in ActiveHDL on the final product to produce a .zip file containing all the essential files. You will submit this file via the Catalyst dropbox at https://catalysttools.washington.edu/collectit/dropbox/summary/luisceze/786