3/14/01 Name: _________________________
This test has 10 questions, totaling 90 points.
The test is closed book, closed notes, closed calculator, open mind, etc. Remember, don't spend all your time on one question. You should budget roughly one minute per point. Please show your work for partial credit.
Question | Possible | Score |
---|---|---|
Page 2 | 6 | |
Page 3 | 10 | |
Page 4 | 10 | |
Page 5 | 10 | |
Page 6 | 10 | |
Page 7 | 8 | |
Page 8 | 17 | |
Page 9 | 19 | |
Total | 90 |
add $1, $2, $3 # 1 lw $6, 8($1) # 2 or $5, $6, $6 # 3 sub $7, $2, $4 # 4 sw $7, 4($1) # 5 add $7, $5, $6 # 61. Given a 5 stage pipeline that resolves data hazards through stalling, how many cycles would the above sequence take to execute? Assume that the results from an ALU operation are available 3 cycles later, and results from loads are available 2 cycles later. You should ignore the "startup cost" of the pipeline. In other words, don't count the cycles before the pipeline is at full capacity.
Basically, the 6 instructions take a cycle each, then there are three stall cycles between 1 and 2, two stalls between 2 and 3, and three more stalls between 4 and 5 (lots of people missed this one). That gives a total of 14 cycles.2. Given a 5 stage pipeline that resolves data hazards through forwarding, how many cycles would the above sequence take to execute? Again, ignore the "startup cost" of the pipeline.
Again, the 6 instructions take a cycle each, but there is one stall cycle between 2 and 3 (the load hazard). Total of 7.3. Can you reorder the above sequence to further improve the forwarding pipeline? If yes, show the new sequence (you can just show the numbers of the instructions). If no, explain your answer.
Many reorderings possible, but mostly you needed to fit an instruction in between 2 and 3. Some people did not maintain the dependence between 3 and 6...
top: sw $0, 0($1) # 1 addi $1, $1, 4 # 2 addi $2, $2, 1 # 3 bne $2, $3, top # 4Assume that the loop iterates 10 times and that our pipeline has a branch delay of 2 cycles. That is, the branch is resolved at the end of the Execute state (the third stage). The pipeline uses forwarding to resolve data hazards to the extent possible.
1. Suppose the pipeline resolves branch hazards by always stalling the pipeline for 2 cycles. How many cycles does it take to execute the above fragment? Again, ignore the "startup cost" of the pipeline.
4 instructions (4 cycles) + 2 stall cycles per iteration. 10 iterations * 6 cycles = 60 cycles2. Suppose the pipeline uses a predict-not-taken scheme, where every branch is predicted as not taken. Correct predictions cost nothing, mispredictions cost 2 cycles. How many cycles does it take to execute the above fragment? Again, ignore the "startup cost" of the pipeline.
4 instructions (4 cycles) + 2 stall cycles for a misprediction 4 instructions (4 cycles) for a correct prediction. There are 9 mispredictions (54 cycles) + 1 correct prediction 4 cycles, giving us a total of 58 cycles.3. Suppose the pipeline uses a branch prediction table. The penalty for a misprediction is 2 cycles. How many cycles does it take to execute the above fragment? (Assume mispredictions on the first and last iteration of the loop.)
8 correct predictions @ 4 cycles each 2 mispredictions @ 6 cycles each Gives us a total of 44 cycles.4. Suppose the pipeline uses delayed branches (with 2 delay slots). Rewrite the code fragment (you can just show the numbers) to take advantage of the delay slots. Insert nops if you can't fill all of the slots. How many cycles does it take to execute the new code?
A new order: 3,4,1,2 This will take 40 cycles.
1. Which of the following best describes the impact of going from a single-cycle implementation to a multi-cycle implementation?
1. Suppose 10% of the instructions are stores and 20% are loads. What is the average number of memory references per instruction?
Each instruction costs at least one memory reference (it must be fetched). So the right answer was: 1 + .2 + .1 = 1.3 ref/ins2. What is the average number of stall cycles per instruction?
At 1.3 ref/ins, 1.3 * .1 * 50 = 6.5 stall cycles/instruction3. In a perfect world (no cache misses) assume the program has a CPI of 2.0. What is the relative performance of the program in the perfect world versus the real world (with misses)?
Perf(perfect) CPI(cache) 2.0 + 6.5 ------------- = ---------- = --------- = 4.25 times faster Perf(cache) CPI(perfect) 2.04. Suppose we add an L2 cache (a second cache between the L1 cache and the main memory) that has a miss rate of 1%. If a miss occurs in the L1 cache, the L2 cache is checked for a possible hit. Misses in the L1 cache that hit in the L2 cache incur a penalty of only 5 cycles, while misses in both caches cost 50 cycles. How much faster is the new organization?
Lots of trouble with this one. Many ways to solve it. Important to note that the miss rate of the L2 cache is 1%, but only 10% of the total memory references ever get there! Stall cycles = stalls due to misses in L1 that hit in L2 + stalls due to misses in L1 that miss in L2 = (1.3 * .099 * 5) + (1.3 * .001 * 50) = .7085 stalls/instruction So the total CPI = 2.7085 versus 8.5, so its about 3 times faster!
1. Which of the following best describes the behavior of a write-back, write-around cache on a read miss?
1. What is the total capacity of the cache (in bytes)?
32KB (32 byte lines * 1K lines)2. Show how the below 32 bit address is broken into its 3 component parts (tag, index, discard).
15 5 0 +----------------------+ | tag | index | dis | (not to scale) +----------------------+ 31 14 4
1. How many entries are in the page table?
VM size / page size = 2^32 / 2^13 = 2^19 pages We need one entry per page!2. If each page table entry is 4 bytes, how large is the page table (in bytes)?
2^19 entries * 2^2 = 2^21 bytes3. Show how the below 32 bit address is broken into its 2 component parts (virtual page number and offset).
13 0 +----------------------+ | VPN | Offset | (not to scale) +----------------------+ 31 12
The OS must make one or more PTEs in P1s page table point (that is map to the same physical frame) at the same page(s) pointed to by PTEs in P2s page table. The OS may set protection bits if the shared data is read-only (like a library), but shared memory may also be used for inter-process communication.5. How can the OS guarantee that process P1 doesn't get to read or write memory belonging to process P3 in a page-based VM system?
P1 and P3 only ever generate virtual addresses. They don't have the power to generate physical addresses because all addresses they issue are translated by the PTEs cached in the TLB. We must assure that no process is allowed to modify PTEs either in its page table or the TLB, We must also assure that no PTEs in one process's page table ever point to the same physical pages belonging to another process (unless we want them to share).
To get full credit, there were 3 things to mention (almost no one did): 1. Many processes share the TLB and many of those processes spend lots of time in the similar regions of virtual memory (eg. their text segments live in the same place in VM). This means that there will be an unusually high number of conflict misses. 2. High associativity reduces conflict misses. 3. High associativity is costly, but this is ok, because TLBs are usually very small.