CSE 378 Final (Solution)

Ben Dugan, Winter 2001


3/14/01                                   Name: _________________________

Instructions:

This test has 10 questions, totaling 90 points.

The test is closed book, closed notes, closed calculator, open mind, etc. Remember, don't spend all your time on one question. You should budget roughly one minute per point. Please show your work for partial credit.

Question Possible Score

Page 2 6

Page 3 10

Page 4 10

Page 5 10

Page 6 10

Page 7 8

Page 8 17

Page 9 19

Total 90

Question	Possible	Score
Page 2	6
Page 3	10
Page 4	10
Page 5	10
Page 6	10
Page 7	8
Page 8	17
Page 9	19
Total	90

Question 1. Data Dependences and Data Hazards (6 points)

Consider the following fragment of a program:

add    $1, $2, $3       # 1
lw     $6, 8($1)        # 2
or     $5, $6, $6       # 3
sub    $7, $2, $4       # 4
sw     $7, 4($1)        # 5
add    $7, $5, $6       # 6

1. Given a 5 stage pipeline that resolves data hazards through stalling, how many cycles would the above sequence take to execute? Assume that the results from an ALU operation are available 3 cycles later, and results from loads are available 2 cycles later. You should ignore the "startup cost" of the pipeline. In other words, don't count the cycles before the pipeline is at full capacity.

Basically, the 6 instructions take a cycle each, then there are three
stall cycles between 1 and 2, two stalls between 2 and 3, and three more
stalls between 4 and 5 (lots of people missed this one).  That gives a total
of 14 cycles.

2. Given a 5 stage pipeline that resolves data hazards through forwarding, how many cycles would the above sequence take to execute? Again, ignore the "startup cost" of the pipeline.

Again, the 6 instructions take a cycle each, but there is one stall cycle
between 2 and 3 (the load hazard).  Total of 7.

3. Can you reorder the above sequence to further improve the forwarding pipeline? If yes, show the new sequence (you can just show the numbers of the instructions). If no, explain your answer.

Many reorderings possible, but mostly you needed to fit an instruction
in between 2 and 3.  Some people did not maintain the dependence between 
3 and 6...

Question 2. Control Hazards (10 points)

Consider the following fragment of a program:

top:   sw     $0, 0($1)         # 1
       addi   $1, $1, 4         # 2
       addi   $2, $2, 1		# 3
       bne    $2, $3, top	# 4

Assume that the loop iterates 10 times and that our pipeline has a branch delay of 2 cycles. That is, the branch is resolved at the end of the Execute state (the third stage). The pipeline uses forwarding to resolve data hazards to the extent possible.

1. Suppose the pipeline resolves branch hazards by always stalling the pipeline for 2 cycles. How many cycles does it take to execute the above fragment? Again, ignore the "startup cost" of the pipeline.

4 instructions (4 cycles) + 2 stall cycles per iteration.

10 iterations * 6 cycles = 60 cycles

2. Suppose the pipeline uses a predict-not-taken scheme, where every branch is predicted as not taken. Correct predictions cost nothing, mispredictions cost 2 cycles. How many cycles does it take to execute the above fragment? Again, ignore the "startup cost" of the pipeline.

4 instructions (4 cycles) + 2 stall cycles for a misprediction
4 instructions (4 cycles) for a correct prediction.
There are 9 mispredictions (54 cycles) + 1 correct prediction 4 cycles,
giving us a total of 58 cycles.

3. Suppose the pipeline uses a branch prediction table. The penalty for a misprediction is 2 cycles. How many cycles does it take to execute the above fragment? (Assume mispredictions on the first and last iteration of the loop.)

8 correct predictions @ 4 cycles each
2 mispredictions @ 6 cycles each
Gives us a total of 44 cycles.

4. Suppose the pipeline uses delayed branches (with 2 delay slots). Rewrite the code fragment (you can just show the numbers) to take advantage of the delay slots. Insert nops if you can't fill all of the slots. How many cycles does it take to execute the new code?

A new order:  3,4,1,2
This will take 40 cycles.

Question 4. Machine Organization (10 points)

Correct answers were #1=1, #2=3, #3=4, #4=3, #5=1. Some answers received partial credit.

1. Which of the following best describes the impact of going from a single-cycle implementation to a multi-cycle implementation?

decrease cycle time, increase CPI
decrease cycle time
increase cycle time, increase CPI
increase cycle time, decrease CPI
increase CPI

2. Which of the following best describes the impact of going from a single-cycle implementation to a pipelined implementation?

decrease cycle time, increase instruction count, increase transistor count
increase cycle time, increase CPI, decrease instruction count
decrease cycle time, increase transistor count, usually increase CPI
decrease cycle time, increase transistor count
increase cycle time, decrease transistor count

3. Which of the following best describes the impact of increasing the number of stages in a pipeline (superpipelining)?

increase cycle time, increase data dependences, increase branch delays
decrease cycle time, increase data dependences
decrease cycle time, decrease CPI, increase data dependences
decrease cycle time, increase data dependences, increase branch delays
decrease cycle time, decrease data dependences, increase branch delays

4. Which of the following best describes the impact of building a machine that can issue multiple instructions simultaneously (superscalar) versus the 5 stage pipeline we studied in class?

decrease CPI, decrease instruction count, increase data dependences
decrease CPI, increase instruction count, increase data dependences
decrease CPI, increase data dependences
increase CPI, increase data dependences
increase instruction count, decrease cycle time

5. Which of the following best describes the impact of improving compiler technology (for a given ISA and machine organization)?

decrease instruction count, minimize loads and stores, improved instruction scheduling
decrease instruction count, improved cycle time
decrease instruction count, minimize loads and stores,
increase instruction count, improved cycle time
increase instruction count, improved instruction scheduling

Question 5. Caching (10 points)

Suppose a benchmark program executes 10,000,000 instructions. On a processor with one level of cache, it has a cache miss rate of 10% and the penalty for a miss is 50 cycles.

1. Suppose 10% of the instructions are stores and 20% are loads. What is the average number of memory references per instruction?

Each instruction costs at least one memory reference (it must be
fetched).  So the right answer was:

  1 + .2 + .1 = 1.3 ref/ins

2. What is the average number of stall cycles per instruction?

At 1.3 ref/ins, 
   
   1.3 * .1 * 50 = 6.5 stall cycles/instruction

3. In a perfect world (no cache misses) assume the program has a CPI of 2.0. What is the relative performance of the program in the perfect world versus the real world (with misses)?

Perf(perfect)   CPI(cache)      2.0 + 6.5
------------- = ----------   =  --------- = 4.25 times faster
Perf(cache)     CPI(perfect)    2.0

4. Suppose we add an L2 cache (a second cache between the L1 cache and the main memory) that has a miss rate of 1%. If a miss occurs in the L1 cache, the L2 cache is checked for a possible hit. Misses in the L1 cache that hit in the L2 cache incur a penalty of only 5 cycles, while misses in both caches cost 50 cycles. How much faster is the new organization?

Lots of trouble with this one.  Many ways to solve it.
Important to note that the miss rate of the L2 cache is 1%, but only
10% of the total memory references ever get there!

Stall cycles = stalls due to misses in L1 that hit in L2
                               +
               stalls due to misses in L1 that miss in L2

             = (1.3 * .099 * 5) + (1.3 * .001 * 50)
             = .7085 stalls/instruction

So the total CPI = 2.7085  versus  8.5, so its about 3 times faster!

Question 6. Memory Hierarchy (8 points)

Correct answers were #1=4, #2=1, #3=4, #4=2. Some answers received partial credit.

1. Which of the following best describes the behavior of a write-back, write-around cache on a read miss?

The block is updated only in memory
The block is updated only in the cache
The block is fetched from memory into the cache
Dirty words in the existing block are written back to memory, and the new block is brought in from memory
Dirty words in the existing block are written back to memory

2. Which of the following best describes the behavior of a write-through, write-allocate cache on a write miss?

The block is fetched from memory, updated in the cache, and updated in memory.
The block is updated only in memory.
The new block is fetched from memory and updated in the cache.
The block is updated only in the cache.
Dirty words in the existing block are written back to memory, the, the new block is fetched from memory, updated in the cache, and updated in memory.

3. Which of the following best describes the behavior of the OS during a TLB miss on a MIPS machine?

Nothing, the miss is handled in hardware.
The OS loads up the TLB with the correct page table entry, and context switches to another program.
The program has performed an illegal operation, so it is killed.
The OS loads up the TLB with the correct page table entry, and restarts the program.
The OS brings the right page in from disk.

4. Which of the following best describes the behavior of the OS during a page fault?

Nothing, the page fault is handled in hardware.
The OS initiates a disk I/O to bring in the right page, and context switches to another program.
The OS loads up the TLB with the correct page table entry, and context switches to another program.
The OS brings in the right page from disk.
The OS initiates a disk I/O to bring in the right page, and restarts the faulting program.

Question 7. Cache Organization (5 points)

Suppose we have a cache with block (line) size of 8 words (32 bytes) and that this cache has a total of 1024 (1K) lines.

1. What is the total capacity of the cache (in bytes)?

32KB  (32 byte lines * 1K lines)

2. Show how the below 32 bit address is broken into its 3 component parts (tag, index, discard).

    15       5        0
+----------------------+
| tag | index | dis    |   (not to scale)
+----------------------+
 31    14      4

Question 8. Virtual Memory & Paging (12 points)

Suppose we have a page-based virtual memory system with 8KB pages, and that we wish to have a 4GB virtual address space.

1. How many entries are in the page table?

VM size / page size = 2^32 / 2^13 = 2^19 pages
We need one entry per page!

2. If each page table entry is 4 bytes, how large is the page table (in bytes)?

2^19 entries * 2^2 = 2^21 bytes

3. Show how the below 32 bit address is broken into its 2 component parts (virtual page number and offset).

      13              0
+----------------------+
| VPN   | Offset       |  (not to scale)
+----------------------+
 31      12

4. How can the OS allow process P1 to share a region of memory with process P2 in a page-based VM system?

The OS must make one or more PTEs in P1s page table point (that is
map to the same physical frame) at the same page(s) pointed to by
PTEs in P2s page table.  The OS may set protection bits if the shared
data is read-only (like a library), but shared memory may also be
used for inter-process communication.

5. How can the OS guarantee that process P1 doesn't get to read or write memory belonging to process P3 in a page-based VM system?

P1 and P3 only ever generate virtual addresses.  They don't have the
power to generate physical addresses because all addresses they issue
are translated by the PTEs cached in the TLB.  We must assure that no
process is allowed to modify PTEs either in its page table or the TLB,
We must also assure that no PTEs in one process's page table ever
point to the same physical pages belonging to another process (unless we
want them to share).

Question 9. TRUE/FALSE (15 points)

TRUE / FALSE Paging systems usually use a write-through policy in order to keep the disk coherent with memory.
TRUE / FALSE TLBs are used to cache data on pages.
TRUE / FALSE Page faults are generally handled in hardware.
TRUE / FALSE Higher associativity reduces conflict misses.
TRUE / FALSE Larger lines (blocks) in a cache may increase conflict misses.
TRUE / FALSE Larger caches decrease both the overall miss rate and access time.
TRUE / FALSE An advantage of smaller disks (in diameter) is that the seek time decreases.
TRUE / FALSE By spinning disks faster we can decrease the rotational latency of a disk I/O.
TRUE / FALSE The ethernet is an example of an interconnection network that uses centralized arbitration.
TRUE / FALSE In the MIPS procedure call convention, leaf procedures (those that don't call any other procedures) often don't have to save any registers.
TRUE / FALSE The VAX implementation is complicated by the many addressing modes provided by the the ISA.
TRUE / FALSE In the BEQ instruction, a branch target may not be further than 2^15 bytes away.
TRUE / FALSE Compilers for the MIPS will typically place long-lived values into caller-saved registers ($T0-$T9) in order to minimize loads and stores to/from the stack.
TRUE / FALSE The cycle time of a pipelined machine is determined by the maximum time required to complete a single stage of the pipeline.
TRUE / FALSE This statement is false.

Question 10. (4 points)

Explain why TLBs are typically fully associative (or have a high degree of associativity) rather than direct mapped?

To get full credit, there were 3 things to mention (almost no one did):

1.  Many processes share the TLB and many of those processes spend
lots of time in the similar regions of virtual memory (eg. their
text segments live in the same place in VM).  This means that there will
be an unusually high number of conflict misses.

2.  High associativity reduces conflict misses.

3.  High associativity is costly, but this is ok, because TLBs are 
usually very small.