CSE 351: The Hardware/Software Interface, Winter 2011
  CSE Home   About Us   Search   Contact Info 
 
Course Home
  Home
Administation
  Overview
  Course email
  Anonymous feedback
  View feedback
 
Assignment Utilities
  Home Virtual Machines
  Homework Turnin
  Class GoPost Forum
  Gradebook
 
Most Everything
  Schedule
    Homework 4 - Datapath and Parallelism
Out: Monday February 7
Due: Monday February 14, 11:59PM
Turnin: Online

These are all paper-and-pencil exercises - no programming.
  1. Basic, Sequential Datapath Operation
    Consider the sequential Y86 implementation shown in Figure 4-22 (page 376) of the text (and elaborated upon in Figure 4-23, page 378). We're going to develop a similar datapath by starting with support for only a single instruction, and then introducing instructions and the components they require one by one.

    Each component of the datapath may take a non-zero amount of time to do its work; this time is called the "latency" of the component. Suppose that the major components have the following latencies:

    Instruction memory400 picoseconds (ps) (400 * 10-12 seconds)
    PC increment150 ps
    Register file250 ps
    ALU220 ps
    Condition codes/signals100 ps
    Data memory500 ps
    Assume that other components of the processor have negligible latency. For instance, it takes no time to choose the next PC based on the instruction and/or jump target. Also remember that the hardware can perform independent operations in parallel, i.e., at once.

    1. Let's start by designing a super-lightweight Y86 processor, capable of executing only the JMP instruction (unconditional jump). Of the major components above, which ones do we need? How many instructions per second can this processor execute?

    2. Now suppose we want also to execute NOP instructions. How long would each instruction take to execute? How many instructions per second can this processor execute?

    3. Suppose we now add support for ADDL. What components do we need? How many instructions per second can this processor execute?

    4. Finally, suppose we now add support for POPL. What components do we need? How many instructions per second can this processor execute?

  2. Pipelined Implementation
    Now consider a pipelined implementation. (Figure 4-52, page 419 of the text, may be useful.) Each of the main components in the pipelined implementation have latencies the same as in question 1. Writing the pipeline registers requires 20 ps.

    1. What is the maximum number of instructions per second that the pipelined implementation can execute?

    2. Pipelining is a form of parallel execution, and so its performance is affected by dependences among the instructions available for execution at any moment. Give an example of an instruction sequence for which the pipeline implementation cannot achieve its maximum theoretical instruction execution rate (even if you let someone very clever fill in the details that are missing from our coarse description of the pipeline).

  3. Compilers and Pipelining
    Although the pipelined implementation gets the right results no matter what instruction sequence it is asked to execute, some instruction sequences execute more quickly than others. This can motivate the compiler to optimize the instruction sequence it emits by ordering instructions for efficient pipelined processing.

    Imagine that the compiler does this using a two-step process. First, it emits "the natural" instruction sequence for the code it has compiled, then as a "second-pass" goes over that instruction sequence and re-orders instructions for efficient pipelined execution.

    1. How can the compiler determine which re-orderings are correct, i.e., get the same results as the original sequence? Briefly describe the criterion that ensures correctness.

    2. Give a correct, more efficient re-ordering for this instruction sequence:
        addl    r2, r3
        subl    r3, r4
        mrmovl  -16(%ebp), r2
        mrmovl  8(r2), r3
        addl    r3, r4
      

  4. Register Renaming
    The compiler has to encode instructions according to the rules of the ISA. The x86 ISA, which dates back to a couple of years after the Magna Carta, was designed when hardware was expensive to build, and so has only 8 registers. The compiler therefore has to generate code that specifies only 8 registers.

    Hardware is cheaper to build now, and implementations can easily afford building many more than 8 registers. Register renaming is a way to use additional registers, even though the compiler has generated code that names only 8 of them. For instance, the hardware might have 32 physical registers. When it fetches an instruction, it is free to change the register names used in the instruction (which are necessarily limited to the range 0 to 7) so that the instruction actually uses registers chosen by the hardware (which can be in the range 0 to 31).

    Register renaming allows the hardware to remove false (WAW and WAR) dependences, which in turn allows it to reorder instructions in ways that otherwise wouldn't have been correct. In the following, assume that the hardware fetches instructions in order, possibly renames registers and re-orders the instructions, and then inserts them into the pipeline.

    1. Apply register renaming and instruction reordering to the code sequence from question 3b so that it executes perfectly efficiently in a pipelined implementation. Assume that when these instructions start execution the data they want to work on is in the registers they name. (For example, the current value of r2 is what that code sequence expects.) Further, assume that none of the "hidden registers" (8-31) have anything useful in them, but all of the ISA registers (0-7) do.

      Note: For reasons that may be clearer after answering the next question, the results of the code sequence do not need to end up in the registers named by the original code sequence (i.e., they do not need to end up in registers 2, 3, and 4). They can end up anywhere you like.

    2. Register renaming requires that the hardware be prepared to re-write both the source and the destination register numbers in instructions it fetches, as well as to figure out a beneficial re-ordering of instructions. In this question, we're concerned only with re-writing source registers, i.e., we're ignoring how the hardware decides whether/how to re-write the destination register number and how to re-order instructions, we'll just assume that it can (and does) do it.

      Assuming that the hardware fetches instructions "in order" (i.e., in the order dictated by the simple, sequential execution model of the ISA), does re-naming and re-ordering, and then inserts the re-named/re-ordered instruction stream into the pipeline, describe a procedure and data structure the hardware can maintain to determine how to re-write source register names. You should assume that the unspecified process of choosing destination register names knows about and updates the data structure you describe. You should also assume that pipeline executing the instructions with renaming is able to resolve any hazards, exactly as the simple five-stage pipeline (without renaming) in earlier questions did.

      Note: I think it will be easier to figure this out yourself than to try to get Google to answer it for you, because the online descriptions consider more complicated situations than the one in this problem and address also renaming of destination registers. Because of that, if you want to try to answer this problem by "researching" known techniques, I'm all for it -- you'll learn a ton. (To repeat, though, I think that's more work than just thinking about it and answering it, and of course my opinion is that the thinking part of that is worth doing.) Search keywords it might take you a while to find: Tomasulo's Algorithm.

  5. Bonus Question [Optional]
    In question 3 the compiler did instruction re-ordering. In question 4, the hardware did both register renaming and instruction re-ordering. If register re-naming isn't implemented in the hardware, is there any point to implementing re-ordering there? (That is, can everything that the hardware achieve by re-ordering be done by the compiler instead, when there is no register re-naming?)

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to zahorjan at cs.washington.edu]