Assigned: 3 November 2007
Due: 16 November 2007
The processor constructed in the previous labs implements a significant part of the MIPS instruction set, but does so very slowly. The problem is that the clock speed is constrained by the longest path a signal can take in a single clock cycle. For the first two labs this path started at the register file, went through the ALU, the memory, and finally back into the register file. In this lab, we'll be shrinking the longest path significantly by breaking the datapath into 5 separate pieces or stages. Adding registers between stages will mean that the longest path a signal must traverse in a single cycle will decrease dramatically. This speed-up will come at a cost, because it will now take 5 cycles for an instruction to complete. To offset this slowdown we'll be able to run independent instructions in each stage, for a maximum of 5 instructions "in-flight", leading to a maximum throughput of 1 instruction per cycle.
As usual, errata for this lab can be found at the lab 3 wiki page.
Background: (pages 370-374, 384-402 from the textbook)
This lab will take place in the same workspace as the previous labs. The files for this lab are provided as an archived design. You will need to restore the design, add it to the workspace, and copy several design files from lab2 into the new design. Download the archived design lab3.zip and follow these steps:
At this point the files are all available. Now we need to tell Active-HDL about the new design.
The fundamental idea behind pipelining is to separate the datapath into individual stages. Each stage will take one clock cycle to complete, and can contain one instruction. This phase will describe the elements of the stages and the pipeline registers that function as the barrier between stages. A pipeline register saves the values from the previous stage so that each stage can perform an independent instruction. In addition, the pipeline registers must have a reset signal so that their initial state is known, and a load enable signal. The load enable will be used later to ensure that certain registers do not update their values in special cases.
The following stage descriptions explain the components that should be in each stage. Please scan the descriptions before making any changes, and be sure to refer back as things progress.
Instruction Fetch (IF) Stage
The first stage is responsible for getting the current instruction from the instruction memory at the address specified by the program counter (PC) . The memory access is a slow process so the resulting instruction goes straight into the pipeline register. In parallel with the fetch, this stage must also compute the next PC address (PC+4). This stage stores the instruction into the IF/ID pipeline register, along with the PC+4 value for use in
JAL
andJALR
instructions.
- Create a wire from the Inst input and name it IF_Inst. Use this when you need to reference the instruction in the IF stage.
- Create a VCC component and name it
LoadEnable
- Hit ESC to set cursor to the arrow
- Hit (P) to change cursor to VCC
- Change PC to use
LoadEnable
- Right-Click the Program Counter and select "Replace Symbol"
- If lib378 appears then skip the next step
- Otherwise, Right-click the middle window and choose "Select Libraries", Expand the "User Libraries" and choose lib378
- Choose register_re and hit OK
- Add a wire
LoadEnable
to the LD pin of the Program Counter- Create an IFIDReg pipeline register and name it IFIDReg
- Use the Symbols Toolbox to add an IFIDReg to cpu.bde
- Select the new IFIDReg component, and hit ALT+Enterto open the properties dialog
- Set the name to IFIDReg ESSENTIAL!!
- Connect PC+4 to the input
IF_NextPC
of IFIDReg- Connect
Inst
toIF_Inst
of IFIDReg- Use the "Add Stubs" menu option to complete signal generation for IFIDReg
Instruction Decode (ID) Stage
This stage is responsible for the majority of the work. It reads an instruction from the IF/ID pipeline register, decodes the instruction, generates control signals, reads values from the register file and performs comparisons for branches (more on that later). This stage stores control signals for all later stages, data values for A and B inputs to the ALU, the result from the extender unit, the PC+4 value, the destination register index (used in lab 4), and control information needed by the ALU into the ID/EX pipeline register.
Execution Stage (EX) Stage
The ALU is the heart of the processor so it gets a stage all of its own. This stage contains the ALU and multiplexers that determine which values are used as the input to the ALU. It stores the output from the ALU, the data value from the RT field of the instruction, and the destination register to the EX/MEM pipeline register along with necessary control signals.
Memory Access (MEM) Stage
This provides an access to memory for a load or store instruction. The EX/MEM register provides the address and data for the memory as well as control signals. The output from the memory is stored in the MEM/WB register along with the ALU output from the EX/MEM register.
Write-back (WB) Stage
This stage is included to separate memory accesses from the register file. In this stage, the write value for the register value is determined, and the register file is potentially updated based on control signals.
This part provides a strategy for organizing the ID stage and completing the pipeline register that ends the stage. The file piperegisters.v contains an incomplete IDEXReg module. However, before completing the register there are others issues to address.
Inst
to use the name ID_Inst
ID_WriteReg(4:0)
At this point its time to address the IDEXReg component defined in piperegisters.v. Your task is to define the input and output ports for the additional data needed in the EX stage and the control signals needed in the EX, MEM, and WB stages. As a starting point, ID_RegWrite is an input control signal and EX_RegWrite is the output of the same signal.
This part provides a strategy for organizing the EX stage and completing the pipeline register that ends the stage. The file piperegisters.v contains an incomplete EXMEMReg module. However, before completing the register there are others issues to address.
EX_Inst
Your task is to define the input and output ports for the control signals that get passed through the EX stage for use in the MEM and WB stages.
The memory stage provides the address and data for writing to memory or reading from memory. It sends the value read from memory and the address (ALUOut) on to the WB stage. The file piperegisters.v contains an incomplete MEMWBReg module. First there are some bookkeeping tasks.
Once these things are complete its time to finalize the MEMWBReg from piperegisters.v. You need to add ports and logic for the control signals that get passed through the EX stage for use in the WB stage.
This stage basically just separates the memory access time from the update to the register file. At this point, all the pipeline registers are in place, and the last task is to rename some signals to ensure that the proper data is being used.
WriteData
input to the RegfileOnce these things are complete its time to finalize the MEMWBReg from piperegisters.v. You need to add ports and logic for the control signals that get passed through the EX stage for use in the WB stage.
The cpu.bde has a fairly limited number of input and output ports. To make the test fixtures more effective there is a file cpu_wrapper.v that "peeks" inside the CPU and exposes a wide range of signals for the test fixtures.
Test the updated cpu.bde with test fixture phase1_tf.v. This test fixture runs through the all of the non-control instructions, and verifies that everything is connected properly. Problems with branch or jump instructions will be addressed in the test fixture for Phase 2.
The processor from Lab 2 decided the next PC value in every cycle based on the current instruction. Each instruction took a single cycle, so branch instructions would know the resulting PC before the next clock cycle. After pipelining, the control signals are not available until the cycle after an instruction is fetched. This causes a control hazard for branch instructions because we do not know if the branch occurs until the following instruction has been fetched.
There are different ways to deal with this problem, and the MIPS designers decided to turn the hazard into a feature by defining a delay slot. The delay slot is the instruction directly after a branch or jump, and is always executed, regardless of branch outcome. This means that no instructions are squashed on branches, because the branch result is known for the instruction after the delay slot. (See COD:3e pp 423-424 for more on delay slots)
In the preceding labs, branch comparisons were performed by forcing a subtract in the ALU. For single-cycle machines this was an efficient reuse of logic. In a pipelined machine we want to make the branch decision in the ID stage, so we need the result of the comparison in the ID stage.
ID_RS_OUT
and ID_RT_OUT
busesID_EQ
Zilch
( from ALU ) with ID_EQ
In labs 1 and 2 the branch address was computed by adding the offset to the address of the next instruction. The simplest way to do this computation was to add the branch offset to PC+4. That would ensure that the proper instruction was targeted. Now, the branch decision is delayed one cycle, meaning that PC+4 is actually two instructions after the branch. There are a number of ways to compute the correct branch target address. The deciding factor is the complexity of the logic required.
The minimal solution is to pass both PC and PC+4 (IF_NextPC) to the pcaddresscomputer. Branch and jump address computations use PC, while PC+4 is used as the default, or for branches that aren't taken.
NextPC
NextPC
in non-branch / non-jump cases The JAL
and JALR
instructions store the address of the "next" instruction to the register file. In Lab 2 the "next" instruction was the immediately
following instruction, meaning that PC+4 was the stored return address.
The addition of the delay slot changes this
behavior, because the instruction directly after the JAL
or JALR
is in the delay
slot. Logically, it takes place before the jump into the subroutine. So,
the "next" instructions is the instruction after the delay
slot. This means that these instructions should store PC+8. Rather than
add a separate +8 adder take advantage of the following trivial equivalence:
PC + 8 = (PC + 4) + 4 = NextPC + 4
Test the updated cpu.bde with test fixture phase2_tf.v. This test fixture is designed to exercise the branch and jump instructions to ensure that the correct comparisons were made and that the return addresses are being computed properly.