CSE 378 Homework #7: Making it Faster
Due: 6/7/2002
Assignment Goals & Administrivia
In this assignment, you'll develop a cache for your pipelined
implementation of the MIPS ISA. You'll also be asked to improve
the performance of the entire system by some other means. Ultimately,
we want you to apply your creative energies to designing a great
machine, and your persuasive powers to convince us why this machine
will beat the competition.
Overview
You'll be provided with a component that simulates (in a simple
manner) the processor's interface to the memory bus, primary memory,
and the character device. This component simulates the latencies
inherent in communicating with primary memory or other devices
connected to the bus. Your job initially will be simply to overhaul
your previous SMOK implementation to utilize this device and record
the number of cycles required to run a variety of benchmarks.
Once you have your base machine working, you'll implement
memory caches for instructions and data. The cache design will be
left to up to you, and your job will be to experiment with different
replacement policies and write (hit and miss) policies, within the
basic guidelines provided below.
The MemoryBus Component
This component simulates a memory bus, primary memory, and the
character device. It will entirely replace your memory component,
InstructionFetch unit, MemoryInterface unit, and character device. It
has seven input ports: DataMemRead, DataMemWrite, DataMemByte,
DataMemAddr, DataMemData, InsMemRead, and PC. The InsMemRead port is
an enable port for reading an instruction. In general, it should
always be set to high (1), because your machine will want to read an
instruction every cycle. The other inputs should be self explanatory
based on their names.
How does it work? The container contains a constant register called
FullLatency, which is the latency of two memory accesses (as in a load
and an instruction fetch occurring on the same cycle). If there are
two memory accesses requested at the same time, this unit will
pretend to spend a number of cycles (defined by FullLatency)
doing the bus transaction required to access the requested memory
item(s). It does this by way of counters that are implemented
internally. When the counters have counted the appropriate number of
delay cycles, the actual memory value will be presented on the output.
On other cycles, garbage values will be presented. In cases where
there is just one memory access (as in just an instruction fetch),
only half the FullLatency number of cycles will be required to fetch
the item. Accesses to the character device require the same number of
cycles as other accesses. The unit provides five output
ports: Stall, OutData (the data item fetched), DataValid, Instruction
(the instruction fetched), InstructionValid. The Stall line will be
asserted as long as either the OutData or Instruction are not valid.
How do I connect it? The inputs should be pretty self explanatory.
As for the outputs, you only really have to worry about the OutData,
Instruction, and Stall ports. The first two should be routed to the
usual places in your machine. The Stall line should be routed to all
of your pipeline registers. Your entire pipeline should be
stalled as long as it is waiting for memory transactions to
complete. Think carefully about managing the first two stages of
the pipeline -- the IF/ID and ID/EX pipeline registers and the PC
update are probably the most complicated aspects. First, the ID/EX
should be reset when you would normally reset it (branch taken or load
hazard) AND the memory is valid. The PC and IF/ID should only be
updated when memory is valid (no stall) and there is no load hazard.
IF/ID should be reset as usual (when branches are taken).
Take some time to explore the component. Once you import the
component into your model (and connect it), you should see performance
degrade dramatically (especially with high latency values). At this
point you'll have a simulation of a good CPU core with a poor memory
hierarchy.
Building a Cache
We'll give you total control over your cache design and
implementation. It's easiest to use register files to implement a
cache. I use two: one to hold the data, and the other to hold the
tag/valid information. Your implementation can be different, of
course. One immediate limitation of the MemoryBus component is that
it does not simulate wide pathways to main memory. In other
words, this makes it harder to experiment with cache block sizes
larger than a single word (4 bytes). We'll leave that kind of
experimentation for later. Once you have implemented even a small
cache, you should see dramatic improvements in performance.
Solution Path, First Part
Below are some hints to get you toward a solution. As usual, your
main strategy should be to understand the issues on paper (and in your
brain) before you start SMOKing.
- Get the MemoryBus component integrated into your model. When it
works, run some benchmarks and record the results (the number of
cycles required) for each benchmark. Save this machine (call it
mips-base.smok).
- Build an InstructionCache. This is a good place to start because
it is simple -- it doesn't have to handle writes, because we are
assuming that we'll never modify instructions while the program is
running. Call this machine mips-icache.smok, and run and record the
same benchmarks.
- Now build a DataCache. The DataCache has to deal with writes
(hits and misses), so you'll need to choose a write policy. The
simplest policy is probably write-through, write-around. Call this
machine mips-fullcache.smok, and run and record the same benchmarks.
Issues, Details, and Pitfalls
Here is a list of issues that you may or may not encounter.
- Save all of your models. You'll need them to compare different
approaches!
- Make sure your memory latency is always realistic!
- You may have to move the latency selection logic external to the
MemoryBus component. For example, if you implement a write-through,
write-allocate cache (which fetches blocks on write-misses, and then
does the write, which will then also write memory), the latency for
this case will need to be necessarily higher than a simple memory
fetch.
- Be careful about loads and stores to/from the character device
addresses (0x40000000 and higher). These addresses should be treated
as uncacheable, because you'll get stale/incorrect data if
you do cache them.
- Reading and writing bytes will be a headache for your Data cache.
On a read miss, you'll want to be sure to bring the appropriate
word that contains the byte in from main memory. On a write
hit, you'll want to be sure to just update the byte within the word
contained in your cache. Depending on your write hit/miss policy, the
writes to the next level of memory should be done with care as well.
Getting this right is probably the hardest part of the assignment!
The Vague Part
The assignment requires that you build an instruction and data cache
(as described above). Below is a list of other enhancements you can
make to your machine, cache, or system software to improve the
performance of the entire system. You'll be required to choose at
least one of them to improve/study the performance of the system.
They're listed below roughly in order of difficulty.
- Set associativity. We only require a direct-mapped cache. Build a
set-associative (2 or 4-way) and see if/how it improves the performance
of your system.
- Write buffers. Hide the latency of writes by keeping a small
(even 4 items makes a big difference) buffer of pairs [address, value]
that describe recent stores. This buffer is drained while the
processor is off doing other work. Of course, loads need to check the
WriteBuffer before going to main memory, and the machine still has to
stall when the buffer is full.
- Other write policies. Build a write-back cache of some form (you'll
still need to deal with write-misses: write-allocate or write-around).
- Larger cache blocks. This is pretty tricky, because our memory
system only really provides one word at a time. You can do it, but
it's harder.
- Anything cache-related with my OK.
Deliverables: Monday, June 3rd
In order to keep you on track, you'll need do the following by next
Friday:
- Integrate the MemoryBus component into your model.
- Tell us the number of cycles required to execute a benchmark
program (which we'll provide) with a full memory latency of 10 cycles
versus a zero-latency memory (a perfect machine).
- A short (one page) writeup telling us what improvement (from the
optional list above) you intend to implement.
Deliverables: Last Day of Class
We'll want your models, as always, and a standard README that gives us
an overview to your files. Additionally, you'll be required to turn
in a short (3 pages or less, not including graphs or diagrams) report
detailing the impact of your improvements. Think of this as a design
recommendation. You should make a recommendation (eg. "Build a
write-back, write-around cache of such and such a size, add a write
buffer, etc...") and then provide the data to back it up.
How are we going to grade this? You'll be graded on effort, scope,
results, creativity, and your writeup.
The Files
Stay tuned. We'll be posting all sorts of goodies, including standard
benchmarks, smok.ini files, and possibly a base machine to get you
started.
Extra Credit
Still got energy? Talk to me.