Out: Thursday, 10 February
Due: Wednesday, 16 February, 11:59 pm
Latest Possible Submission Time: Saturday, 19 February, 11:59pm
Turnin: Gradescope
All parts of this question are about code the compiler produces for this bit of C (whose meaning should be clear to you because it's the same as in Java). Both the scalar (integer) variable i and the integer array A are local variables.
for (i=1; i<100; i++) { A[i] += A[i-1]; }
For each snippet of assembler shown, corresponding to different optimization levels, indicate how many cycles are needed to issue all the instructions on a 5 stage pipeline with forwarding. The differences among the number of cycles reflect both the limitations of pipelining and the distinctions in the code that might be produced by the compiler.
.L3: lw a5,-20(s0) slli a5,a5,2 addi a4,s0,-16 add a5,a4,a5 lw a4,-404(a5) lw a5,-20(s0) addi a5,a5,-1 slli a5,a5,2 addi a3,s0,-16 add a5,a3,a5 lw a5,-404(a5) add a4,a4,a5 lw a5,-20(s0) slli a5,a5,2 addi a3,s0,-16 add a5,a3,a5 sw a4,-404(a5) lw a5,-20(s0) addi a5,a5,1 sw a5,-20(s0) .L2: lw a4,-20(s0) li a5,99 ble a4,a5,.L3
How many cycles are required to issue all the instructions in this version of the code?
In this version the compiler doesn't bother to update local variable i as it goes through the loop. Instead, it recognizes that it can have a register pointing at the next array element and just increment that pointer by 4 each iteration. (a5 is that pointer. a2 is a pointer to the end of array. Both are initialized before entering the loop.)
.L7: lw a4,0(a5) lw a3,-4(a5) add a4,a4,a3 sw a4,0(a5) addi a5,a5,4 bne a5,a2,.L7
How many cycles are required to issue all the instructions in this version of the code?
We ask the compiler to try harder to optimize.
.L4: lw a4,0(a5) lw a3,-4(a5) addi a5,a5,4 add a4,a4,a3 sw a4,-4(a5) bne a5,a2,.L4
(A) How many cycles are required to issue all the instructions in this version of the code?
(B) What optimization did the compiler make here, relative to the -O1 case?
.L4: lw a3,0(a5) addi a5,a5,4 add a4,a4,a3 sw a4,-4(a5) bne a5,a2,.L4
(A) How many cycles are required to issue all the instructions in this version of the code?
(B) What optimization did the compiler make here, relative to the -O2 case?
(A) Draw a dependence diagram of the instruction stream from Question 2 (-O1). Label each dependence with its type: RAW (read-after-write), WAR (write-after-read), or WAW (write-after-write).
(B) Assuming that the CPU has enough hardware resources (e.g., ALU's) that instructions are never delayed due to hardware contention but only due to dependences, and assuming that each instruction can be completed in one cycle once it is eligible to run, how lmany cycles would such a processor take to execute the instruction sequence?
Turnin a pdf file with your answers on Canvas.