CSEP 548 - Autumn 2017 - Homework 1

Issued: Oct 4, 2017

Due: Oct 18, 11:59pm.

Last Updated: Oct 4, 2017

This homework should be able to be done after doing the readings assigned for the first lecture. This assignment is borrowed from Milo Martin's class at UPenn (source).

Instructions:

This is an individual work assignment. Sharing of answers or code is strictly prohibited. For the quantitative questions, show your work to document how you came up with your answers.

Submission:

Turn in an electronic copy in PDF format to the Catalyst dropbox

Problems:

  1. Performance and CPI. Assume a typical program has the following instruction type breakdown:

    Assume the current-generation processor has the following instruction latencies:

    If for the next-generation design you could pick one type of instruction to make twice as fast (half the latency), which instruction type would you pick? Why?

  2. Averaging. Assume that a program executes one branch every 4 instructions for the first 1M instructions and a branch every 8 instructions for the next 2M instructions. What is the average number instructions per branch for the entire program?

  3. Amdahl's Law. A program has 10% divide instructions. All non-divide instructions take one cycle. All divide instructions take 20 cycles.

    1. What is the CPI of this program on this processor?
    2. What percent of time is spent just doing divides?
    3. What would the speedup be if we sped up divide by 2x?
    4. What would the speedup be if we sped up divide by 5x?
    5. What would the speedup be if we sped up divide by 20x?
    6. What would the speedup be if divide instructions were infinitely fast (zero cycles)?
  4. Performance and ISA. Chip A executes the ARM ISA and has a 2.5Ghz clock frequency. Chip B executes the x86 and has a 3Ghz clock frequency. Assume that on average, programs execute 1.5 times as many ARM instructions than x86 instructions.

    1. For Program P1, Chip A has a CPI of 2 and Chip B has a CPI of 3. Which chip is faster for P1? What is the speedup for Program P1?
    2. For Program P2, Chip A has a CPI of 1 and Chip B has a CPI of 2. Which chip is faster for P2? What is the speedup for Program P2?
    3. Assuming that Programs P1 and P2 are equally important workloads for the target market of this chip, which chip is "faster"? Calculate the average speedup.
  5. ISA Modification and Performance Metrics. You are the architect of a new line of processors, and you have the choice to add new instructions to the ISA if you wish. You consider adding a fused multiply/add instruction to the ISA (which is helpful in many signal processing and image processing applications). The advantage of this new instruction is that it can reduce the overall number of instructions in the program by converting some pairs of instructions into a single instruction. The disadvantage is that its implementation requires extending the clock period by 5% and increases the CPI by 10%.

    1. Calculate under what conditions adding this instruction would lead to an overall performance increase.
    2. Discuss qualitatively the impact this proposed change could have on energy consumption, complexity of processor design, and code generation by compilers.

Textbook problems

The following problems come from Appendix A in the Hennessy/Patterson Computer Architecture, 5th Edition.

  1. (A.8: a. only) For the following we consider instruction encoding for instruction set architectures.
  1. Consider the case of a processor with an instruction length of 12 bits and with 32 general-purpose register so the size of the address fields is 5 bits. Is it possible to have instruction encodings for the following?
  1. <skip>
  2. <skip>
  1. (A.9) For the following assume that values A, B, C, D, E, and F reside in memory. Also assume that instruction operation codes are represented in 8 bits, memory addresses are 64 bits, and register addresses are 6 bits, and data values are 32-bit integers (4 bytes each).

    Stack

    Accumulator

    Register (register-memory)

    Register (load-store)

    Push A

    Push B

    Add

    Pop C

    Load A

    Add B

    Store C

    Load R1,A

    Add R3,R1,B

    Store R3,C

    Load R1,A

    Load R2,B

    Add R3,R1,R2

    Store R3,C

    Figure A.2 The code sequence for ``C = A + B`` for four classes of instruction sets. Note that the Add instruction has implicit operands for stack and accumulator architectures and explicit operands for register architectures. It is assumed that A, B, and C all belong in memory and that the values of A and B cannot be destroyed. Figure A.1 shows the Add operation for each class of architecture.

    1. For each instruction set architecture shown in Figure A.2, how many addresses, or names, appear in each instruction for the code to compute C = A + B, and what is the total code size?
    2. Some of the instruction set architectures in Figure A.2 destroy operands in the course of computation. This loss of data values from processor internal storage has performance consequences. For each architecture in Figure A.2, write the code sequence to compute:
    C = A + B
    D = A - E
    F = C + D
    

    In your code, mark each operand that is destroyed during execution and mark each "overhead" instruction that is included just to overcome this loss of data from processor internal storage. What is the total code size, the number of bytes of instructions and data moved to or from memory, the number of overhead instructions, and the number of overhead data bytes for each of your code sequences?

Compiler activity

Related to <A.8> in Hennessy/Peterson.

In this exercise, you will compile a couple small pieces of code and compare the assembly code generated. We will use an online tool named "gcc-explorer", by Matt Godbolt, to take a look at the generated assembly code and compare the output with several different configurations.

To use this tool, navigate to: http://gcc.godbolt.org. To begin, insert some code into the "Code editor" text area on the left and watch as the assembly is generated on the right. Try experimenting with the compiler options at the top. For these exercises, we recommend having all of the "filters" enabled except "Intel syntax". This should result in a fairly minimal amount of assembly code, with lines colored to show (roughly) how it corresponds to the code on the left.

Two useful references for understanding the generated assembly code:

  1. GCC Optimization levels: Compiler optimizations may result in improvements to code size and/or performance.
  1. Select x86-64 gcc 4.7.1 for the compiler, clear out any options, and copy the following code into the code editor:
int testFunction(int* input, int length) {
  int sum = 0;
  for (int i = 0; i < length; ++i) {
    sum += input[i];
  }
  return sum;
}

Copy/paste the assembly output into your solution.

  1. The default is to compile with -O0. Try adding the -O1 option in the "compiler options" field. Notice how the generated assembly changed. What is the percent change in number of instructions?
  2. Now change the optimization flag to -O3. What happens to the amount of generated code? Notice the new registers being used: %xmm{0,1} and new instructions beginning with p. Explain briefly what these instructions do differently than the less-optimized version.
  3. Vector extensions: The latest generations of Intel and AMD x86 architectures have a new set of instructions called "Advanced Vector Extensions" (AVX) which allow programs to make use of wider vector units. With -O3 still enabled, try compiling with -mavx. What do you notice has changed? And for a bit more fun, now try compiling with -mavx2 to enable the second generation of AVX instructions (to be included in future architectures). Find one new instruction introduced in AVX2 (don't worry about figuring out what it does).
  1. x86 vs. ARM Assembly: Using the same code as the last problem, change the compiler option to ARM gcc 4.6.4, with no compiler options specified. Compare the number of instructions. Given that x86 is (roughly) a CISC ISA and ARM is (roughly) a RISC ISA, would you expect this difference? Give an example of how x86 could use fewer instructions than ARM to do the same operation.
    1. Bonus: find an example of this in the code provided or with your own code (hint: use the colorized output to isolate the effect for specific lines of C code).