|
|
|
|
Homework 4 - Datapath and Parallelism
Out: Monday February 7
Due: Monday February 14, 11:59PM
Turnin: Online
These are all paper-and-pencil exercises - no programming.
- Basic, Sequential Datapath Operation
Consider the sequential Y86 implementation shown in Figure 4-22 (page 376) of the text
(and elaborated upon in Figure 4-23, page 378).
We're going to develop a similar datapath
by starting with support for only a single instruction, and then introducing instructions
and the components they require one by one.
Each component of the datapath may take a non-zero amount of time to do its
work; this time is called the "latency" of the component. Suppose that
the major components have the following latencies:
Instruction memory | 400 picoseconds (ps) (400 * 10-12 seconds) |
PC increment | 150 ps |
Register file | 250 ps |
ALU | 220 ps |
Condition codes/signals | 100 ps |
Data memory | 500 ps |
Assume that other components of the processor have negligible latency.
For instance, it takes no time to choose the next PC based on the instruction
and/or jump target. Also remember that the hardware can perform independent
operations in parallel, i.e., at once.
- Let's start by designing a super-lightweight Y86 processor, capable
of executing only the JMP instruction (unconditional jump). Of the major
components above, which ones do we need?
How many instructions per second can this processor execute?
- Now suppose we want also to execute NOP instructions. How long would
each instruction take to execute?
How many instructions per second can this processor execute?
- Suppose we now add support for ADDL. What components do we need?
How many instructions per second can this processor execute?
- Finally, suppose we now add support for POPL. What components do we need?
How many instructions per second can this processor execute?
- Pipelined Implementation
Now consider a pipelined implementation. (Figure 4-52, page 419 of the text,
may be useful.) Each of the main components in the pipelined implementation have latencies
the same as in question 1. Writing the pipeline registers requires 20 ps.
- What is the maximum number of instructions per second that
the pipelined implementation can execute?
- Pipelining is a form of parallel execution, and so its performance
is affected by dependences among the instructions available for execution at any moment.
Give an example of an instruction sequence
for which the pipeline implementation cannot achieve
its maximum theoretical instruction execution rate (even if you let
someone very clever fill in the details that are missing from
our coarse description of the pipeline).
- Compilers and Pipelining
Although the pipelined implementation gets the right results no
matter what instruction sequence it is asked to execute, some instruction sequences execute more quickly
than others. This can motivate the compiler to optimize the instruction sequence it emits
by ordering instructions for efficient pipelined processing.
Imagine that the compiler does this using a two-step process. First, it emits "the natural" instruction
sequence for the code it has compiled, then as a "second-pass" goes over that instruction sequence and
re-orders instructions for efficient pipelined execution.
- How can the compiler determine which re-orderings are correct, i.e.,
get the same results as the original
sequence? Briefly describe the criterion that ensures correctness.
- Give a correct, more efficient re-ordering for this instruction sequence:
addl r2, r3
subl r3, r4
mrmovl -16(%ebp), r2
mrmovl 8(r2), r3
addl r3, r4
- Register Renaming
The compiler has to encode instructions according to the rules of the ISA. The x86 ISA, which dates back to
a couple of years after the Magna Carta, was designed
when hardware was expensive to build, and so has only 8 registers. The compiler therefore has to generate code
that specifies only 8 registers.
Hardware is cheaper to build now, and implementations can easily afford building many more than 8 registers.
Register renaming is a way to use additional registers, even though the compiler has generated code that names
only 8 of them. For instance, the hardware might have 32 physical registers.
When it fetches an instruction, it is free
to change the register names used in the instruction (which are necessarily limited to the range 0 to 7) so that
the instruction actually uses registers chosen by the hardware (which can be in the range 0 to 31).
Register renaming allows the hardware to remove false (WAW and WAR) dependences, which in turn allows it to reorder
instructions in ways that otherwise wouldn't have been correct. In the following, assume that the hardware fetches
instructions in order, possibly renames registers and re-orders the instructions, and then inserts them into
the pipeline.
- Apply register renaming and instruction reordering
to the code sequence from question 3b
so that it executes perfectly efficiently in a pipelined implementation. Assume that
when these instructions start execution the data they want to work on is in the registers they name. (For example,
the current value of r2 is what that code sequence expects.) Further, assume that none of the "hidden registers"
(8-31) have anything useful in them, but all of the ISA registers (0-7) do.
Note: For reasons that may be clearer after answering the next question, the results of the code sequence
do not need to end up in the registers named by the original code sequence (i.e., they do not need to end
up in registers 2, 3, and 4). They can end up anywhere you like.
- Register renaming requires that the hardware be prepared to re-write
both the source and the destination register numbers in instructions it fetches, as well
as to figure out a beneficial re-ordering of instructions. In this question, we're concerned
only with re-writing source registers, i.e., we're ignoring how the hardware decides whether/how to re-write the
destination register number and how to re-order instructions, we'll just assume that it can (and does) do it.
Assuming that the hardware fetches instructions "in order" (i.e., in the order dictated by the
simple, sequential execution model of the ISA), does re-naming and re-ordering, and then inserts
the re-named/re-ordered instruction stream into the pipeline,
describe a procedure and data structure the hardware can maintain
to determine how to re-write source register names. You should assume that the unspecified process of
choosing destination register names knows about and updates the data structure you describe. You should also
assume that pipeline executing the instructions with renaming is able to resolve any hazards, exactly as the
simple five-stage pipeline (without renaming) in earlier questions did.
Note: I think it will be easier to figure this out yourself than to try to get Google to answer it for you,
because the online descriptions consider more complicated situations than the one in this problem and address also
renaming of destination registers. Because of that, if you want to try to answer this problem by "researching" known
techniques, I'm all for it -- you'll learn a ton. (To repeat, though, I think that's more work than just thinking
about it and answering it, and of course my opinion is that the thinking part of that is worth doing.)
Search keywords it might take you a while to find: Tomasulo's Algorithm.
- Bonus Question [Optional]
In question 3 the compiler did instruction re-ordering. In question 4, the hardware did both register renaming
and instruction re-ordering. If register re-naming isn't implemented in the hardware, is there any point to
implementing re-ordering there? (That is, can everything that the hardware achieve by re-ordering be done by
the compiler instead, when there is no register re-naming?)
|