CSE 378 - Winter 2002
Machine Organization and Assembly Language Programming
Homework 6: Pipelined Processor Implementation
Due Monday, March 4th
Purpose:
The purpose of this homework is to learn how pipelines are implemented. To
implement your pipeline, you will be creating stages and saving stage state, and
implementing pipeline forwarding and interlocks.
This assignment is to be done in pairs. You must work with a
different person than you did on Homework #5.
Tasks:
Task 1:
Your task is to take the single-cycle implementation from
Homework #5 and pipeline it. Your
pipeline will be a conventional 5-stage MIPS pipeline.
If you are confident of your solution to the last homework, you may use
that as a starting point for this assignment, since you're already
familiar with it. However, if you are uncertain of your implementation,
we don't want you to be at a disadvantage, so you may request a sample
solution to start from. If you want this option (or you just want to
see how Harlan did it), email harlan
with the request.
As was the case in the last assignment, the ultimate resource for your
questions is the R2000 processor (except for changes listed in this handout and
HW5).
Task 2: Write a (main-program) function that counts the number of
1's in the binary representation of a word, and run this program on your
machine.
The Details
- Pipeline:
- The pipeline is:
Fetch -> Register Read -> Execute -> Memory Access -> Register Writeback
See Figure 6.40 in your book for a detailed explanation of the
pipeline stages.
- Data hazards:
-
You should implement data hazard detection and forwarding logic to resolve
data hazards. For load
delays, if a consumer instruction of a loaded value follows the load
instruction,
your pipeline will have to stall until the load has returned from memory
(see figure 6.46).
For arithmetic operations, forwarding logic should be sufficient to
resolve hazards.
NOTE: We have been talking about hardware that writes to the
register file in the first half of the clock cycle, and reads from it
in the second half. Unfortunately, SMOK does not work this way. For this
reason, you will need to have additional forwarding that goes from
the WB stage to the ID stage. Or you may find some other solution to this
problem.
- Branch hazards:
-
You will need to handle branch hazards. The basic problem is that your pipeline
doesn't know it has fetched a branch instruction until the decode stage, doesn't
know the result of the branch condition until the end of the EX stage, and
doesn't change the PC value until the MEM stage. By
then it will already have fetched the following sequential instructions.
Design hardware to handle branches and control hazards. This will involve
deciding in which stage your branch resolution hardware should be and squashing
any wrong-path instructions that are fetched. Depending on where you place your
branch resolution hardware, you may or may not have to extend the forwarding
logic you built to eliminate data hazards.
- Memory layout:
- Use the same memory layout as Homework #5.
- Instructions:
- Support the following instructions:
- Arithmetic:
- ADDU, ADDIU, SUBU
AND, SRL <--------- these are new for HW6
- Load/Store:
- LW, SW
- Branch:
- BEQ, BNE
- Startup code:
-
You can begin with Homework #5's startup code and modify as appropriate, but you
shouldn't need much. You can just plunk your code at address 0, and put some
kind of termination at the end of it.
- The executing program:
- Write a new procedure in MIPS assembly and compile it down to the
instructions you implemented. This procedure is called
countbits(int x). What this procedure does is it counts the number of 1
bits in a word. In C this procedure may look like:
int countbits(int x) {
int v;
v =0;
while( x != 0 ) {
if ( (x & 1) != 0 )
v = v +1;
x = x >> 1;
}
return ( v );
}
Please follow register conventions--put arguments and return values in the
appropriate places so your code is easy to test.
- Performance
- Although your primary goal is to design and implement a correctly
functioning pipeline and to write a correct test program, try to implement
efficient versions of them both. How many cycles does it take to execute your
code on your pipeline?
How many cycles does your pipeline stall when executing this code?
(for both of these questions, you can use the input 0x6139, but feel free
to test cases of your own design as well.)
- Written Report
- Write a short report (one to two pages is sufficient) describing:
- the design decisions you encountered, with your choices justified. Be
sure to mention what parts of the design you thought were tricky or that you
found a partiticularly nice solution for;
- how you tested your machine, and how you debugged it;
- what you learned, and what you had trouble with;
- the distribution of work between partners.
- Turnin
- Turn in your code online, using
this form,
and turn in your written report in class. Please submit only one
set of files per group -- there is no need for you and your partner to both
submit code (or written reports) separately. Just make sure to indicate
both partners' information on the turnin, so you will both get a grade.
Both written and code parts are due by class on 3/4.
I suggest you finish the code part early, and leave time to describe it
in the written report.
EXTRA CREDIT:
YOU CAN ONLY GET CREDIT FOR THESE IF YOUR BASIC PIPELINE ABOVE
WORKS.
- Implement JAL, JR, and the other necessary opcodes to make fibonacci
work, then convert your code and run it. How much faster is it?
- Add a branch predictor. This will be a simple predictor that simply
predicts the last address of the branch. This is nominally called a
"branch target buffer" (BTB). What you do is take the address of the
control instruction in the 2nd pipeline stage and feed the low bits of the
address to a table (use a memory of 16 entries). From this table start
fetching in the next cycle from the outcome of this table. When you
resolve the actual branch target, write this target into the table location
of the branch address. If you predicted the branch target correctly, then
you can keep fetching and executing. However, if you predicted incorrectly,
you will need to squash the instructions that were incorrectly fetched
(but not the delay slot instruction!).
Grading:
You will be graded on:
- functionality of countbits on your machine (does countbits
perform the correct function?)
- correctness of your machine (do all the instructions on your
machine work correctly?)
- design of your machine (is it understandable, are the design
decisions you made justifiable?)
- your written report
Hints:
- The design given in the book and in lecture is not set in stone. You
can move functionality to one side or another of a pipeline latch as you
see fit. Ideally, you want each stage to take the same amount of time,
but for this assignment you dont need to be too worried about that. Just
don't go overboard--no putting all your functionality into one stage and
then having a bunch of empty stages.
- You may find it necessary to initialize your pipeline with a nop instruction,
to avoid unwanted side effects when starting your execution. You can do
this by setting an initial value for the appropriate ControlRegisterRegisters
in the IF/ID latch.
- There is a bug in SMOK that shouldn't affect your functionality, but
may affect the readability of your design: the lines connecting to
ControlRegisterRegisters inside a
ControlRegister (don't shoot us -- we didn't name these things!)
shift around after saving and reloading the file. This
can make things look pretty ugly, since lines no longer terminate at
their connection points. Hopefully this will be fixed soon, but
you may find it safer to group ControlRegisterRegisters with a component
other than a ControlRegister. (It used to be worse -- this used to happen
with all containers that have input/output.)
- To do the SRL operation efficiently, think 'binary'.