Homework 5 Solutions

(updated 11/19/03)

(62 points)

2.1 - 2 points

Program 1: M2 2x faster
Program 2: M1 1.33 times faster

2.2 - 2 points

Instructions per second for M1 = 200 000 000 instructions / 10 seconds = 20 000 000 instrs/sec

Instructions per second for M1 = 160 000 000 instructions / 5 seconds = 32 000 000 instrs/sec

2.3 - 2 points

M1 CPI = (200 000 000 cycles/s) / (20 000 000 instrs/s) = 10 cycles/instruction

M2 CPI = (300 000 000 cycles/s) / (32 000 000 instrs/s) = 9.375 cycles/instruction

2.4 - 4 points

M1 executed a total number of (200 000 000 cycles/sec * 3 sec) = 600 million cycles.  With 10 cycles/instr, it executed 60 million instructions

M2 executed a total number of (300 000 000 cycles/sec * 4 sec) = 1.2 billion cycles.  With 9.375 cycles/instr, it executed 1 200 000 000 / 9.375 = 128 000 000 instructions

2.5 - 2 points

Buy M2.  M2 costs only 1.5 more than M1, but is twice as fast.  This is a good deal.

2.6 - 4 points

 The cost divided by execution time metric is unusable.  Example: consider two machines A and B.  Machine A costs $100 and executes program 1 in 1 second; machine B costs $500 and executes program 1 in 5 seconds.  Clearly, machine A is better (5x faster plus 5x cheaper).  Yet by this metric, A and B have the same performance, since 100/1 = 500/5.

On the other hand, the cost times execution time could be a reasonably good metric, with less meaning a better machine.  In previous example, we get (correctly) that machine A is 25x better than machine B.  Intuitively, both cost and execution time must be minimized, so they should be multiplied and not divided.  This metric, however, must be used carefully: for example we'll have a $100 machine A that executes program 1 in 5 seconds be equivalent to a $500 machine B that executes program 1 in 1 second, but the buying decision depends on what kind of behavior we want/need.  Suppose we have $500 to spend.  We can buy five machine A's that would do five program 1's in 5 seconds in parallel, or we could buy one machine B that could execute one program 1 every second (and thus also complete 5 programs in 5 seconds).  There are reasons one might prefer either one, and our metric doesn't really help with that decision.  (For example, if program 1 must execute sequentially (each execution depends on results of previous one), then we'd probably pick machine B).  Also note that the metric doesn't take throughput into consideration(e.g. maybe a machine can execute a program in 5 seconds, but it has 16 CPUs and can execute 16 programs in the same 5 seconds), so if we cared about it, we'd have to include it into the metric.

In general, you had to specify the first metric was unusuable and give an example why.  Grading on the second metric was lenient, and both "acceptable" and "unusuable" answers were accepted as long as you gave reasonable explanations.

2.7 - 2 points

Acceptable metrics here: performance / cost = 1/(execution_time * cost), or throughput / cost, or (instructions per second)/cost.

Not acceptable: cost/execution time, see 2.6 (same example applies here)

2.15 - 4 points

For MFP:  
CPI = 6*.1 + 4*.15 + 20*.05 + 2*.7 = 3.6
MIPS = (IC/exec_time * 106) = (IC * (clock rate) / (clock cycles) * 106) = (clock rate) / (CPI * 106) = 1000*106 / (3.4 * 106) = 277.78

For MFNP (trick question):
CPI = 2 since there are only integer instructions on MFNP and they all take 2 cycles!
MIPS = 1000*106 / (2 * 106) = 500

Many of you tried the following approach:
CPI = (30*2)*.1 + (20*2)*.15 + (50*2)*.05 + 2*0.7 = 18.4
MIPS = (clock rate) / (CPI * 1000000) = 1000*106 / (18.4 * 106) = 54.35
It would make sense to calculate MIPS in this way to compare to MIPS rating of the MFP machine, but fundamentally, this is not the native MIPS rating for MFNP!  In MFNP we only have integer instructions executing, which constitute 100% of all possible instructions (because f.p. instructions are converted to integer ones).  Each one takes 2 cycles.  Thus, CPI for MFNP is always 2!  So you didn't have to use the table with the number of instructions for emulation of f.p., which was to be used in 2.16 instead...

2.16 - 4 points

Of the 300,000,000 instructions from MFP, 70% of them (210,000,000) remain for MFNP.  Others add up as follows:
Mult: 300 * 106 * .10 * 30 = 900 million
Add: 300 * 106 * .15 * 20 = 900 million
Div: 300 * 106 * .05 * 50 = 750 million

Total: int instrs + Mult instrs + Add instrs + Div instrs = 2,760,000,000 instructions (2.76 billion)

2.17- 4 points

Execution time on MFP: IC*CPI*(Clock cycle time) = (300*106 instructions) * 3.6 * (1 cycle / 109 sec) = 1.08 sec

Execution time on MFNP: IC*CPI*(Clock cycle time) = (2760 * 106 instructions) * 2 * (1 cycle / 109 sec) = 5.52 sec

Alternatively, you could also use the MIPS ratings you obtained from 2.15 to get to this, as exec_time = (# of million instrs)/MIPS.  Then, you had to be careful about what you used for your instruction counts.  For MFP, it's 300 M instructions.  For MFNP, if you came up with a MIPS of 500, then you had to use the new instruction count from 2.16, because you first have to convert program P to all integer ops before executing it.  If you got an "incorrect" MIPS of 54.35, then you could use it with 300M instructions and get the correct answer here (intuitively, this works because you've already "converted" f.p. to int in your MIPS rating).

5.1 - 8 points

RegDst selects a destination register from rt (when it is 0) and rd (when it is 1).  Having it at 0 means only those instructions that write to rt (or don't write to register file at all) still function, while all the instructions writing to rd fail (i.e. R-type instructions).

ALUSrc selects between the register data coming from register file (this data was read for rt register field) and the immediate field.  ALUSrc = 0 selects the register output, and ALUSrc = 1 selects the immediate.  So, ALUSrc=0 means all I-format instructions except branches fail, since they use the immediate.  Branches are excluded because they don't use the main ALU - they add the immediate to PC+4 using the other ALU (above main ALU on fig 5.19).

MemToReg controls whether we take the memory output or the ALU output to write back to register file.  MemToReg=0 always selects the ALU output, so all the loads (only instructions that use memory output) now fail (this includes lw,lb,lh,lhu,lbu).

Zero = 0.  Zero is the zero flag from the ALU used for evaluating branch conditions.  Having it at 0 forces PCSrc to always be 0, and thus forces the PC to always be updated with PC+4 (and not the branch address for which PCSrc=1).  Thus, all branches that are taken break.  All branches that are not taken are fine, since they still select PC+4 for the next PC.  Since the datapath in fig 5.19 didn't have jumps, we cannot give a definitive answer about what happens to them.

For grading, you should have explained why the break occurs (i.e. for RegDst, you said an instruction that use rd as destination breaks, e.g. add).  If you didn't, you should have either clearly specified which instruction types break (easier to do), or list all instruction types that still work.  I allowed you to use the list of instructions supported by the single-cycle implementation only (which are lw, sw, beq, add, sub, slt, and, or). 

5.5 - 4 points

There are no additions to be made to the datapath.  All the logic is there and all we need to modify are the control signals.

For addi, set the control signals as
RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp
0 1 0 1 0 0 0 00 (add)

Note: first version of solutions accidentally had a wrong value for RegDst (1 and not 0).

Note: the value 10 for the ALUop is wrong here, because that tells the ALU to figure out the operation from the function field of the instruction, which doesn't exist for addi, because it's in immediate instruction format!

5.8- 4 points

Let's define this instruction as lwr rd, rs, rt  which will have meaning rd = mem[rs+rt].

Again, datapath doesn't need to be changed.  This is an R-type instruction and we have everything necessary to implement it.

The control signals are

RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp
1 0 1 1 1 0 0 00 (add)

Note: just as above, many of you wanted to put the value 10 for the ALUop.  In this case, the instruction does have room for the function code, but you needed to specify what to set this function field to if you wanted to set ALUop=10 (e.g. to value 100000 which is addition).  Otherwise, you are saying ALU should figure out the operation based on function field of instruction, and you never define what this field is for the new load instruction (this was -1, and you weren't penalized again if you made the same mistake in 5.5). 

5.10- 5 points

For single-cycle implementation, this is a valid modification that always works.  MemtoReg is set to 1 for all loads and to 0 for all other instructions.  MemRead is likewise set to 1 for all loads and 0 for all other instructions.  So we can use MemRead whenever MemtoReg should be 1.  Also, all instructions except loads will have MemRead and MemtoReg set to 0, so we can use MemRead whenever MemtoReg should be 0.  Intuitively, we only read memory if we're using the result (by storing it in reg. file), and if we don't read memory, we don't read the memory output.  Thus, MemRead and MemtoReg controls are equivalent.  This was the main answer we were looking for.

For multi-cycle implementation, this is again a valid modification, but only if we set MemRead to correct values (corresponding to MemtoReg) in correct states.  We only are worried about the case where we actually write data to register file, i.e. when RegWrite is 1 (it doesn't matter what gets selected as destination if it doesn't actually get written).  Looking at the state diagram in fig. 5.42, RegWrite is asserted only in states 4 and 7.  In state 4, MemtoReg is 1, and we can add MemRead=1 to this state without any harm (since we've already had MemRead=1 in previous state 3).  In state 7, MemtoReg is 0, and we can again define MemRead=0 here without doing any harm to completion of execution. 

In pipelined implementation, this modification breaks, since the controls now belong to different stages with different instructions in them.  We could be reading memory in stage 4, but writing a result of R-type instruction in stage 5, and the controls would have opposite values.

The question was a little ambiguous about which of the datapaths you were supposed to consider.  The important thing was to explain results for single-cycle, and explaining others was optional.

5.14 - 6 points

a) Longest execution time is still for a load instruction: instr_mem + register_read + ALU_op + data_mem + reg_write = 2 + 1 + 2 + 2 + 1 = 8 ns, so clock cycle time is 8 ns.

Note that branches would now take max(PC calculation, getting comparison result) = max(3 + 3, 2+1+2) = 6 ns

b) Longest execution time is now for a branch.PC_increment_ALU + Branch_address_ALU = 5+5 = 10 ns (the comparison part of a branch still takes 5 ns to go through IF, register fetch, and regular ALU).  So new clock cycle time = 10 ns.

c) Branch is still the longest instruction: we perform the PC+4 in 1 ns, but that result cannot be used in the branch address calculation yet, because the branch hasn't been fetched yet (and we need the instruction to extract the immediate),  The instruction fetch takes 2 ns.  Then, we need to form the branch address, and that takes 8 ns, for a total delay of 10 ns.  As before, the comparison in the main ALU is done in 5 ns and ready before the branch address, so we don't need to account for it.  Thus the clock cycle time is 10 ns.

For some reason, many of you gave variable clock-length answers here, using some instruction mix to get the average clock cycle time.  This is not what was meant here.  Cycle time by itself means find the clock cycle time with which the processor would have maximum performance.  That means, take the lowest clock cycle time with which all instructions have time to finish.  To calculate this, you needed to see how long each instruction type takes to propagate through, and take the worst case time to be your cycle time.  Calculating the average cycle time for a variable-length clock is a different problem.  However, given that problem didn't exactly specify to use a fixed-length clock type and not a variable-length, and since your book given an example involving the latter following the timing data, the problem was graded leniently.

5.15 - 5 points 

addi is similar to add, except the destination register is rt and not rd, and the ALU has to take sign-extended immediate as its second operand.

Introduce two new states to state diagram on p.416 (fig 5.50): state A and state B.

State 1 transfers to state A if op = "addi".  State A goes to State B unconditionally.  State B goes to state 0 unconditionally.  Thus, addi will take 4 cycles (states 0,1,A,B).

In state A, set the controls ALUsrca=1 (use rs as ALU source), ALUsrcB=10 (use sign-extended immediate as ALU source 2), and ALUOp = 00 (select ALU function to be "add").

In state B, set the controls RegDst=0 (use rt as destination register), RegWrite=1 (enable writing ALU result to register file), MemtoReg=0 (select ALU output to be written and not the memory output).

A diagram of this is coming soon...

Another valid solution here was to use state 2 (third state for lw/sw instructions) instead of new state A, and add transition on "addi" from state 1 to state 2.  Then from state 2, you could add state B as above and transition to it if opcode is "addi".