CSE 378 Final

Ben Dugan, Winter 2001


3/14/01                                   Name: _________________________

Instructions:

This test has 10 questions, totaling 90 points.

The test is closed book, closed notes, closed calculator, open mind, etc. Remember, don't spend all your time on one question. You should budget roughly one minute per point. Please show your work for partial credit.

Question Possible Score

Page 2 6

Page 3 10

Page 4 10

Page 5 10

Page 6 10

Page 7 8

Page 8 17

Page 9 19

Total 90

Question	Possible	Score
Page 2	6
Page 3	10
Page 4	10
Page 5	10
Page 6	10
Page 7	8
Page 8	17
Page 9	19
Total	90

Question 1. Data Dependences and Data Hazards (6 points)

Consider the following fragment of a program:

add    $1, $2, $3       # 1
lw     $6, 8($1)        # 2
or     $5, $6, $6       # 3
sub    $7, $2, $4       # 4
sw     $7, 4($1)        # 5
add    $7, $5, $6       # 6

1. Given a 5 stage pipeline that resolves data hazards through stalling, how many cycles would the above sequence take to execute? Assume that the results from an ALU operation are available 3 cycles later, and results from loads are available 2 cycles later. You should ignore the "startup cost" of the pipeline. In other words, don't count the cycles before the pipeline is at full capacity.

2. Given a 5 stage pipeline that resolves data hazards through forwarding, how many cycles would the above sequence take to execute? Again, ignore the "startup cost" of the pipeline.

3. Can you reorder the above sequence to further improve the forwarding pipeline? If yes, show the new sequence (you can just show the numbers of the instructions). If no, explain your answer.

Question 2. Control Hazards (10 points)

Consider the following fragment of a program:

top:   sw     $0, 0($1)         # 1
       addi   $1, $1, 4         # 2
       addi   $2, $2, 1		# 3
       bne    $2, $3, top	# 4

Assume that the loop iterates 10 times and that our pipeline has a branch delay of 2 cycles. That is, the branch is resolved at the end of the Execute state (the third stage). The pipeline uses forwarding to resolve data hazards to the extent possible.

1. Suppose the pipeline resolves branch hazards by always stalling the pipeline for 2 cycles. How many cycles does it take to execute the above fragment? Again, ignore the "startup cost" of the pipeline.

2. Suppose the pipeline uses a predict-not-taken scheme, where every branch is predicted as not taken. Correct predictions cost nothing, mispredictions cost 2 cycles. How many cycles does it take to execute the above fragment? Again, ignore the "startup cost" of the pipeline.

3. Suppose the pipeline uses a branch prediction table. The penalty for a misprediction is 2 cycles. How many cycles does it take to execute the above fragment? (Assume mispredictions on the first and last iteration of the loop.)

4. Suppose the pipeline uses delayed branches (with 2 delay slots). Rewrite the code fragment (you can just show the numbers) to take advantage of the delay slots. Insert nops if you can't fill all of the slots. How many cycles does it take to execute the new code?

Question 3. MIPS Assembly Coding (10 points)

Please translate the following C fragments into MIPS assembly code. Assume the following declarations have already been translated and that $t0 holds the value of x, $t1 holds the value of y, and $t2 holds the base address of the array foo. (You don't have to worry about updating the respective memory locations for variables x and y!)

int x, y, foo[100];

x = foo[17];







foo[x] = y;







y = y + x + 1;







if (x < 0) {
  x = -x;
}

Question 4. Machine Organization (10 points)

1. Which of the following best describes the impact of going from a single-cycle implementation to a multi-cycle implementation?

decrease cycle time, increase CPI
decrease cycle time
increase cycle time, increase CPI
increase cycle time, decrease CPI
increase CPI

2. Which of the following best describes the impact of going from a single-cycle implementation to a pipelined implementation?

decrease cycle time, increase instruction count, increase transistor count
increase cycle time, increase CPI, decrease instruction count
decrease cycle time, increase transistor count, usually increase CPI
decrease cycle time, increase transistor count
increase cycle time, decrease transistor count

3. Which of the following best describes the impact of increasing the number of stages in a pipeline (superpipelining)?

increase cycle time, increase data dependences, increase branch delays
decrease cycle time, increase data dependences
decrease cycle time, decrease CPI, increase data dependences
decrease cycle time, increase data dependences, increase branch delays
decrease cycle time, decrease data dependences, increase branch delays

4. Which of the following best describes the impact of building a machine that can issue multiple instructions simultaneously (superscalar) versus the 5 stage pipeline we studied in class?

decrease CPI, decrease instruction count, increase data dependences
decrease CPI, increase instruction count, increase data dependences
decrease CPI, increase data dependences
increase CPI, increase data dependences
increase instruction count, decrease cycle time

5. Which of the following best describes the impact of improving compiler technology (for a given ISA and machine organization)?

decrease instruction count, minimize loads and stores, improved instruction scheduling
decrease instruction count, improved cycle time
decrease instruction count, minimize loads and stores,
increase instruction count, improved cycle time
increase instruction count, improved instruction scheduling

Question 5. Caching (10 points)

Suppose a benchmark program executes 10,000,000 instructions. On a processor with one level of cache, it has a cache miss rate of 10% and the penalty for a miss is 50 cycles.

1. Suppose 10% of the instructions are stores and 20% are loads. What is the average number of memory references per instruction?

2. What is the average number of stall cycles per instruction?

3. In a perfect world (no cache misses) assume the program has a CPI of 2.0. What is the relative performance of the program in the perfect world versus the real world (with misses)?

4. Suppose we add an L2 cache (a second cache between the L1 cache and the main memory) that has a miss rate of 1%. If a miss occurs in the L1 cache, the L2 cache is checked for a possible hit. Misses in the L1 cache that hit in the L2 cache incur a penalty of only 5 cycles, while misses in both caches cost 50 cycles. How much faster is the new organization?

Question 6. Memory Hierarchy (8 points)

1. Which of the following best describes the behavior of a write-back, write-around cache on a read miss?

The block is updated only in memory
The block is updated only in the cache
The block is fetched from memory into the cache
Dirty words in the existing block are written back to memory, and the new block is brought in from memory
Dirty words in the existing block are written back to memory

2. Which of the following best describes the behavior of a write-through, write-allocate cache on a write miss?

The block is fetched from memory, updated in the cache, and updated in memory.
The block is updated only in memory.
The new block is fetched from memory and updated in the cache.
The block is updated only in the cache.
Dirty words in the existing block are written back to memory, the, the new block is fetched from memory, updated in the cache, and updated in memory.

3. Which of the following best describes the behavior of the OS during a TLB miss on a MIPS machine?

Nothing, the miss is handled in hardware.
The OS loads up the TLB with the correct page table entry, and context switches to another program.
The program has performed an illegal operation, so it is killed.
The OS loads up the TLB with the correct page table entry, and restarts the program.
The OS brings the right page in from disk.

4. Which of the following best describes the behavior of the OS during a page fault?

Nothing, the page fault is handled in hardware.
The OS initiates a disk I/O to bring in the right page, and context switches to another program.
The OS loads up the TLB with the correct page table entry, and context switches to another program.
The OS brings in the right page from disk.
The OS initiates a disk I/O to bring in the right page, and restarts the faulting program.

Question 7. Cache Organization (5 points)

Suppose we have a cache with block (line) size of 8 words (32 bytes) and that this cache has a total of 1024 (1K) lines.

1. What is the total capacity of the cache (in bytes)?

2. Show how the below 32 bit address is broken into its 3 component parts (tag, index, discard).

+----------------------+
|                      |
+----------------------+

Question 8. Virtual Memory & Paging (12 points)

Suppose we have a page-based virtual memory system with 8KB pages, and that we wish to have a 4GB virtual address space.

1. How many entries are in the page table?

2. If each page table entry is 4 bytes, how large is the page table (in bytes)?

3. Show how the below 32 bit address is broken into its 2 component parts (virtual page number and offset).

+----------------------+
|                      |
+----------------------+

4. How can the OS allow process P1 to share a region of memory with process P2 in a page-based VM system?

5. How can the OS guarantee that process P1 doesn't get to read or write memory belonging to process P3 in a page-based VM system?

Question 9. TRUE/FALSE (15 points)

TRUE / FALSE Paging systems usually use a write-through policy in order to keep the disk coherent with memory.
TRUE / FALSE TLBs are used to cache data on pages.
TRUE / FALSE Page faults are generally handled in hardware.
TRUE / FALSE Higher associativity reduces conflict misses.
TRUE / FALSE Larger lines (blocks) in a cache may increase conflict misses.
TRUE / FALSE Larger caches decrease both the overall miss rate and access time.
TRUE / FALSE An advantage of smaller disks (in diameter) is that the seek time decreases.
TRUE / FALSE By spinning disks faster we can decrease the rotational latency of a disk I/O.
TRUE / FALSE The ethernet is an example of an interconnection network that uses centralized arbitration.
TRUE / FALSE In the MIPS procedure call convention, leaf procedures (those that don't call any other procedures) often don't have to save any registers.
TRUE / FALSE The VAX implementation is complicated by the many addressing modes provided by the the ISA.
TRUE / FALSE In the BEQ instruction, a branch target may not be further than 2^15 bytes away.
TRUE / FALSE Compilers for the MIPS will typically place long-lived values into caller-saved registers ($T0-$T9) in order to minimize loads and stores to/from the stack.
TRUE / FALSE The cycle time of a pipelined machine is determined by the maximum time required to complete a single stage of the pipeline.
TRUE / FALSE This statement is false.

Question 10. (4 points)

Explain why TLBs are typically fully associative (or have a high degree of associativity) rather than direct mapped?