Homework 2: Solution

Textbook problems

The following problems come from Appendix A in the Hennessy/Patterson Computer Architecture, 5th Edition.

(A.8: a. only) For the following we consider instruction encoding for instruction set architectures.
1. Consider the case of a processor with an instruction length of 12 bits and with 32 general-purpose register so the size of the address fields is 5 bits. Is it possible to have instruction encodings for the following?
  - 3 two-address instructions
  - 30 one-address instructions
  - 45 zero-address instructions
    Solution:
    
    Apologies for the ambiguity of the question. If you answered it either of the following ways correctly, you got the points.
    
    (if you interpreted it as all 3 bullets at once)
    
    All of the instructions can fit together with their needed operands if variable-length opcodes are used. We can discover this without coming up with the encoding by enumeration:
```
3 2-addr. inst: 3 * 2^5 * 2^5 possible combinations = 3072
30 1-addr inst: 30 * 2^5 = 960
45 1-addr inst: 45
total = 3072 + 960 + 45 = 4077
2^12 == 4096, and 4077 < 4096, so it should be possible.
```
    If we want to break down how to actually encode these instructions:
```
3 inst = 2 bits with 1 extra encoding
  + 5 bits * 2 addresses = 12 bits
  So:
  00 + 2 * 5 bit addr.
  01      "
  10      "

Next set will have to use the 4th (11) value to differentiate from the first 3.

30 inst. fits in 2^5 (32), so we have 2 remaining

2 bits + 5-bit opcode + 5-bit addr = 12 bits

11 + 00000 + 5 addr bits
...
11 + 11101 + 5 addr bits

Use the two remaining encodings (in this example: 11+11110 and 11+11111) plus the remaining bits to represent the zero-address instructions.

45 inst < 2^6 (64), so 6 unique bits for opcode

11 + 1111 + 6 bits
```
    So yes, all of these instructions can be encoded with 13 bits.
    
    (if you interpreted it as 3 separate sets)
    - 3 two-address instructions:
      
      3 inst = 2 bit opcode (e.g. 00,01,10) 2 addresses = 5 bits * 2 = 10 bits total = 2 + 5*2 = 12
      
      (yes)
    - 30 one-address instructions:
      
      opcode: 2^5 = 32, so need 5 bits (log30/log2 == 4.9 and round up) total = 5 + 1*(5 bits) = 10
      
      (yes)
    - 45 zero-address instructions:
      
      opcode: 2^6 = 64 so 6 bits total = 6 + 0*(5 bits)
      
      (yes)
2. <skip>
3. <skip>

(A.9) For the following assume that values A, B, C, D, E, and F reside in memory. Also assume that instruction operation codes are represented in 8 bits, memory addresses are 64 bits, and register addresses are 6 bits, and data values are 32-bit integers (4 bytes each).

Stack	Accumulator	Register (register-memory)	Register (load-store)
`Push A` `Push B` `Add` `Pop C`	`Load A` `Add B` `Store C`	`Load R1,A` `Add R3,R1,B` `Store R3,C`	`Load R1,A` `Load R2,B` `Add R3,R1,R2` `Store R3,C`

Figure A.2 The code sequence for ``C = A + B`` for four classes of instruction sets. Note that the Add instruction has implicit operands for stack and accumulator architectures and explicit operands for register architectures. It is assumed that A, B, and C all belong in memory and that the values of A and B cannot be destroyed. Figure A.1 shows the Add operation for each class of architecture.

For each instruction set architecture shown in Figure A.2, how many addresses, or names, appear in each instruction for the code to compute C = A + B, and what is the total code size?

Arch. type

Stack

Accumulator

Register (register-memory)

Register (load-store)

# Addresses

3

3

7 (3 mem. + 4 reg.)

9 (3 mem. + 6 reg.)

Code size

224 bits

216 bits

240 bits

260 bits

Arch. type	Stack	Accumulator	Register (register-memory)	Register (load-store)
# Addresses	3	3	7 (3 mem. + 4 reg.)	9 (3 mem. + 6 reg.)
Code size	224 bits	216 bits	240 bits	260 bits

Some of the instruction set architectures in Figure A.2 destroy operands in the course of computation. This loss of data values from processor internal storage has performance consequences. For each architecture in Figure A.2, write the code sequence to compute:

C = A + B
D = A - E
F = C + D

In your code, mark each operand that is destroyed during execution and mark each "overhead" instruction that is included just to overcome this loss of data from processor internal storage. What is the total code size, the number of bytes of instructions and data moved to or from memory, the number of overhead instructions, and the number of overhead data bytes for each of your code sequences?

Destroyed operands.

Overhead instruction (or load).

Data moved assumes 4-byte (32-bit) words.

Arch. type:

Stack

bits

Accumulator

bits

Register (reg-mem)

bits

Register (load-store)

bits

Code:

Push A

Push B

Add (A,B)

Pop C (C)

Push A

Push E

Sub (A,E)

Pop D (D)

Push C

Push D

Add (C,D)

Pop F (F)

8+64 8+64 8 8+64 8+64 8+64 8 8+64 8+64 8+64 8 8+64

Load A

Add B (A)

Store C

Load A (C)

Sub E (A)

Store D

Add C

Store F

8+64 8+64 8+64 8+64 8+64 8+64 8+64 8+64

Load R1, A

Add R2, R1, B

Store R2, C

Sub R3, R1, E

Store R3, D

Add R4, R3, C

Store R4, F

8+6+64 8+6+6+64 8+6+64 8+6+6+64 8+6+64 8+6+6+64 8+6+64

Load R1, A

Load R2, B

Add R3, R1, R2

Store R3, C

Load R4, E

Sub R5, R1, R4

Store R5, D

Add R6, R3, R5

Store R6, F

8+6+64 8+6+64 8+6+6+6 8+6+64 8+6+64 8+6+6+6 8+6+64 8+6+6+6 8+6+64

Code size:

672 bits

576 bits

564 bits

546 bits

Data moved:

9*4B = 36 B (288 b)

8*4B = 32 B (256 b)

7*4B = 28 B (224 b)

6*4 = 24 B (192 b)

Overhead inst.:

3

1 [2 if you count 'Add C']

0 [1 if you count 'Add C']

0

Overhead data bytes:

3*4 = 12 bytes

2*4 = 8 bytes

1*4 = 4 bytes

0 bytes

For Reg-Mem, also accepted answers that assumed a 3-register version of Add existed.

Arch. type:	Stack	bits	Accumulator	bits	Register (reg-mem)	bits	Register (load-store)	bits
Code:	Push A Push B Add (A,B) Pop C (C) Push A Push E Sub (A,E) Pop D (D) Push C Push D Add (C,D) Pop F (F)	8+64 8+64 8 8+64 8+64 8+64 8 8+64 8+64 8+64 8 8+64	Load A Add B (A) Store C Load A (C) Sub E (A) Store D Add C Store F	8+64 8+64 8+64 8+64 8+64 8+64 8+64 8+64	Load R1, A Add R2, R1, B Store R2, C Sub R3, R1, E Store R3, D Add R4, R3, C Store R4, F	8+6+64 8+6+6+64 8+6+64 8+6+6+64 8+6+64 8+6+6+64 8+6+64	Load R1, A Load R2, B Add R3, R1, R2 Store R3, C Load R4, E Sub R5, R1, R4 Store R5, D Add R6, R3, R5 Store R6, F	8+6+64 8+6+64 8+6+6+6 8+6+64 8+6+64 8+6+6+6 8+6+64 8+6+6+6 8+6+64
Code size:	672 bits	576 bits	564 bits	546 bits
Data moved:	9*4B = 36 B (288 b)	8*4B = 32 B (256 b)	7*4B = 28 B (224 b)	6*4 = 24 B (192 b)
Overhead inst.:	3	1 [2 if you count 'Add C']	0 [1 if you count 'Add C']	0
Overhead data bytes:	3*4 = 12 bytes	2*4 = 8 bytes	1*4 = 4 bytes	0 bytes

Compiler activity

Related to <A.8> in Hennessy/Peterson.

In this exercise, you will compile a couple small pieces of code and compare the assembly code generated. We will use an online tool named "gcc-explorer", by Matt Godbolt, to take a look at the generated assembly code and compare the output with several different configurations.

To use this tool, navigate to: http://gcc.godbolt.org. To begin, insert some code into the "Code editor" text area on the left and watch as the assembly is generated on the right. Try experimenting with the compiler options at the top. For these exercises, we recommend having all of the "filters" enabled except "Intel syntax". This should result in a fairly minimal amount of assembly code, with lines colored to show (roughly) how it corresponds to the code on the left.

Two useful references for understanding the generated assembly code:

x86 Instruction Set Documentation (in particular, if you've never seen x86 assembly before, look at pages 17-19 for a quick rundown on the syntax of instructions and operands)
ARM Instruction Set Quick Reference

GCC Optimization levels: Compiler optimizations may result in improvements to code size and/or performance.
1. Select g++-4.7 (Ubuntu/Linaro...) for the compiler, clear out any options, and copy the following code into the code editor:
```
int testFunction(int* input, int length) {
  int sum = 0;
  for (int i = 0; i < length; ++i) {
    sum += input[i];
  }
  return sum;
}
```
  Copy/paste the assembly output into your solution.
  Solution:
```
testFunction(int*, int):
  pushq       %rbp
  movq        %rsp, %rbp
  movq        %rdi, -24(%rbp)
  movl        %esi, -28(%rbp)
  movl        $0, -8(%rbp)
  movl        $0, -4(%rbp)
  jmp .L2
.L3:
  movl        -4(%rbp), %eax
  cltq
  leaq        0(,%rax,4), %rdx
  movq        -24(%rbp), %rax
  addq        %rdx, %rax
  movl        (%rax), %eax
  addl        %eax, -8(%rbp)
  addl        $1, -4(%rbp)
.L2:
  movl        -4(%rbp), %eax
  cmpl        -28(%rbp), %eax
  setl        %al
  testb       %al, %al
  jne .L3
  movl        -8(%rbp), %eax
  popq        %rbp
  ret
```
2. The default is to compile with -O0. Try adding the -O1 option in the "compiler options" field. Notice how the generated assembly changed. What is the percent change in number of instructions?
  - -O0: 26 lines (23 inst)
  - -O1: 15 lines (12 inst)
  ~57% reduction in code size (labels count)
3. Now change the optimization flag to -O3. What happens to the amount of generated code? Notice the new registers being used: %xmm{0,1} and new instructions beginning with p. Explain briefly what these instructions do differently than the less-optimized version.
  
  Code size increases greatly (now 71 lines, 5.9x larger than -O1). Operations starting with p operate on "packed" data, which allows for vector-like functionality with integer values, which may be more efficient, but requires more instructions to pack and unpack, as well as guards to check if there are enough iterations to do a full packed instruction (at the end, especially, if length is not evenly divisible by the packed size, it must do them independently.
4. Vector extensions: The latest generations of Intel and AMD x86 architectures have a new set of instructions called "Advanced Vector Extensions" (AVX) which allow programs to make use of wider vector units. With -O3 still enabled, try compiling with -mavx. What do you notice has changed? And for a bit more fun, now try compiling with -mavx2 to enable the second generation of AVX instructions (to be included in future architectures). Find one new instruction introduced in AVX2 (don't worry about figuring out what it does).
  
  The flag -mavx enables generation of AVX (Advanced Vector Extensions) instructions. These enable using wider vector units than SSE and supports a more rich set of operations on vector units. AVX instructions are prefixed with v so with this option enabled, many instructions generated for the loop have v's. AVX2, enabled by -mavx2 enables second-generation AVX instructions. An example of one of the new AVX2 instructions is vextracti128 which is able to pull smaller integer values out of the new 256-bit-wide vector registers in yet-to-be made "Haswell" chips.
x86 vs. ARM Assembly: Using the same code as the last problem, change the compiler option to arm-linux-gnueabi-g++-4.6, with no compiler options specified. Compare the number of instructions. Given that x86 is (roughly) a CISC ISA and ARM is (roughly) a RISC ISA, would you expect this difference? Give an example of how x86 could use fewer instructions than ARM to do the same operation.

Yes, I know this question was kind of a setup, but you should have noticed that gcc generated 39 lines of code for ARM, which took 26 lines for x86. This is to be expected because ARM does not allow "indirect" operands.
1. Bonus: find an example of this in the code provided or with your own code (hint: use the colorized output to isolate the effect for specific lines of C code).
```
C:
  int sum = 0;

x86:
  movl $0 -8(%rbp)

ARM:
  mov r3, #0
  str r3, [r7, #8]
```
  You can see that {%rbp,r7} holds the base pointer, which, with an 8-byte offset addresses the value where 'sum' is stored. Just storing the value '0' to it takes 2 instructions in ARM.