CSE 378 Homework #7

CSE 378 Homework #7: Making it Faster

Due: 6/7/2002

Assignment Goals & Administrivia

In this assignment, you'll develop a cache for your pipelined implementation of the MIPS ISA. You'll also be asked to improve the performance of the entire system by some other means. Ultimately, we want you to apply your creative energies to designing a great machine, and your persuasive powers to convince us why this machine will beat the competition.

Overview

You'll be provided with a component that simulates (in a simple manner) the processor's interface to the memory bus, primary memory, and the character device. This component simulates the latencies inherent in communicating with primary memory or other devices connected to the bus. Your job initially will be simply to overhaul your previous SMOK implementation to utilize this device and record the number of cycles required to run a variety of benchmarks.

Once you have your base machine working, you'll implement memory caches for instructions and data. The cache design will be left to up to you, and your job will be to experiment with different replacement policies and write (hit and miss) policies, within the basic guidelines provided below.

The MemoryBus Component

This component simulates a memory bus, primary memory, and the character device. It will entirely replace your memory component, InstructionFetch unit, MemoryInterface unit, and character device. It has seven input ports: DataMemRead, DataMemWrite, DataMemByte, DataMemAddr, DataMemData, InsMemRead, and PC. The InsMemRead port is an enable port for reading an instruction. In general, it should always be set to high (1), because your machine will want to read an instruction every cycle. The other inputs should be self explanatory based on their names.

How does it work? The container contains a constant register called FullLatency, which is the latency of two memory accesses (as in a load and an instruction fetch occurring on the same cycle). If there are two memory accesses requested at the same time, this unit will pretend to spend a number of cycles (defined by FullLatency) doing the bus transaction required to access the requested memory item(s). It does this by way of counters that are implemented internally. When the counters have counted the appropriate number of delay cycles, the actual memory value will be presented on the output. On other cycles, garbage values will be presented. In cases where there is just one memory access (as in just an instruction fetch), only half the FullLatency number of cycles will be required to fetch the item. Accesses to the character device require the same number of cycles as other accesses. The unit provides five output ports: Stall, OutData (the data item fetched), DataValid, Instruction (the instruction fetched), InstructionValid. The Stall line will be asserted as long as either the OutData or Instruction are not valid.

How do I connect it? The inputs should be pretty self explanatory. As for the outputs, you only really have to worry about the OutData, Instruction, and Stall ports. The first two should be routed to the usual places in your machine. The Stall line should be routed to all of your pipeline registers. Your entire pipeline should be stalled as long as it is waiting for memory transactions to complete. Think carefully about managing the first two stages of the pipeline -- the IF/ID and ID/EX pipeline registers and the PC update are probably the most complicated aspects. First, the ID/EX should be reset when you would normally reset it (branch taken or load hazard) AND the memory is valid. The PC and IF/ID should only be updated when memory is valid (no stall) and there is no load hazard. IF/ID should be reset as usual (when branches are taken).

Take some time to explore the component. Once you import the component into your model (and connect it), you should see performance degrade dramatically (especially with high latency values). At this point you'll have a simulation of a good CPU core with a poor memory hierarchy.

Building a Cache

We'll give you total control over your cache design and implementation. It's easiest to use register files to implement a cache. I use two: one to hold the data, and the other to hold the tag/valid information. Your implementation can be different, of course. One immediate limitation of the MemoryBus component is that it does not simulate wide pathways to main memory. In other words, this makes it harder to experiment with cache block sizes larger than a single word (4 bytes). We'll leave that kind of experimentation for later. Once you have implemented even a small cache, you should see dramatic improvements in performance.

Solution Path, First Part

Below are some hints to get you toward a solution. As usual, your main strategy should be to understand the issues on paper (and in your brain) before you start SMOKing.

Get the MemoryBus component integrated into your model. When it works, run some benchmarks and record the results (the number of cycles required) for each benchmark. Save this machine (call it mips-base.smok).
Build an InstructionCache. This is a good place to start because it is simple -- it doesn't have to handle writes, because we are assuming that we'll never modify instructions while the program is running. Call this machine mips-icache.smok, and run and record the same benchmarks.
Now build a DataCache. The DataCache has to deal with writes (hits and misses), so you'll need to choose a write policy. The simplest policy is probably write-through, write-around. Call this machine mips-fullcache.smok, and run and record the same benchmarks.

Issues, Details, and Pitfalls

Here is a list of issues that you may or may not encounter.

Save all of your models. You'll need them to compare different approaches!
Make sure your memory latency is always realistic!
You may have to move the latency selection logic external to the MemoryBus component. For example, if you implement a write-through, write-allocate cache (which fetches blocks on write-misses, and then does the write, which will then also write memory), the latency for this case will need to be necessarily higher than a simple memory fetch.
Be careful about loads and stores to/from the character device addresses (0x40000000 and higher). These addresses should be treated as uncacheable, because you'll get stale/incorrect data if you do cache them.
Reading and writing bytes will be a headache for your Data cache. On a read miss, you'll want to be sure to bring the appropriate word that contains the byte in from main memory. On a write hit, you'll want to be sure to just update the byte within the word contained in your cache. Depending on your write hit/miss policy, the writes to the next level of memory should be done with care as well. Getting this right is probably the hardest part of the assignment!

The Vague Part

The assignment requires that you build an instruction and data cache (as described above). Below is a list of other enhancements you can make to your machine, cache, or system software to improve the performance of the entire system. You'll be required to choose at least one of them to improve/study the performance of the system. They're listed below roughly in order of difficulty.

Set associativity. We only require a direct-mapped cache. Build a set-associative (2 or 4-way) and see if/how it improves the performance of your system.
Write buffers. Hide the latency of writes by keeping a small (even 4 items makes a big difference) buffer of pairs [address, value] that describe recent stores. This buffer is drained while the processor is off doing other work. Of course, loads need to check the WriteBuffer before going to main memory, and the machine still has to stall when the buffer is full.
Other write policies. Build a write-back cache of some form (you'll still need to deal with write-misses: write-allocate or write-around).
Larger cache blocks. This is pretty tricky, because our memory system only really provides one word at a time. You can do it, but it's harder.
Anything cache-related with my OK.

Deliverables: Monday, June 3rd

In order to keep you on track, you'll need do the following by next Friday:

Integrate the MemoryBus component into your model.
Tell us the number of cycles required to execute a benchmark program (which we'll provide) with a full memory latency of 10 cycles versus a zero-latency memory (a perfect machine).
A short (one page) writeup telling us what improvement (from the optional list above) you intend to implement.

Deliverables: Last Day of Class

We'll want your models, as always, and a standard README that gives us an overview to your files. Additionally, you'll be required to turn in a short (3 pages or less, not including graphs or diagrams) report detailing the impact of your improvements. Think of this as a design recommendation. You should make a recommendation (eg. "Build a write-back, write-around cache of such and such a size, add a write buffer, etc...") and then provide the data to back it up.

How are we going to grade this? You'll be graded on effort, scope, results, creativity, and your writeup.

The Files

Stay tuned. We'll be posting all sorts of goodies, including standard benchmarks, smok.ini files, and possibly a base machine to get you started.

Extra Credit

Still got energy? Talk to me.

dugan@cs.washington.edu