Lab 4: Dual Core Processing

Assigned: May 14th
Due: 6:00pm, May 30th
Checkpoint Date: BEFORE 6pm, May 22nd

WARNING: This lab is significantly less structured than previous labs. This means that you have a lot more freedom to design it in the way that you want, but this comes at the cost of having comprehensive test fixtures. You will have to find your own ways of testing the system within the provided framework, so start early as this will take a lot of time. Also, read the entire lab before you start. This is not a trivial task, and you should keep an eye on what you will have to do to finish the lab while you design your solution.

Description

At this point, you have constructed a fully functional 5-stage pipelined processor. We will now extend this design further by establishing a way for two of these processors to operate together on the same board. In order to do this, we will need a more extensive memory system which allows the two processors to keep memory synchronized between them. This process will involve modifying your existing processor to deal with stall signals from provided instruction and data caches, then modifying the memory system to keep the data caches in the two processors coherent.

In order to keep things simple, you'll be implementing a system where coherency between the caches is maintained by ensuring that a particular block of memory exists in only one of the caches at a time. To do this, you'll have to modify the caches to allow the invalidation of lines and create a coherency engine to manage which cache gets control of a given block of memory at any point in time. This will involve coming up with your own protocol between the data caches and the coherency engine so that no block of memory will exist in both caches at the same time. You will not have to worry about the instruction caches because the board disallows writes to the instruction memory after the initial boot sequence. Finally, you will modify the processors to add a new instruction (not standard to the MIPS architecture) that will let you lock memory for one processor at a time.

By the end of this lab, you will have a fully functional dual core processor running on the FPGAs along with a program that takes advantage of the two processors.

Phase 0: Set up

To start off this lab, you'll need to add a new design to your existing workspace.

Phase 1: Working with the caches

The first phase of this lab will involve modifying your existing processor to work with the caches. We've already designed the caches for you, and all that you have to do is modify your processor to properly utilize the stall signals coming from them. Whenever you recieve a stall signal, you need to freeze the entire processor until the load/store has been completed. This applies to both instructions and data memory accesses. You will probably want to begin by adding a Stall port to the processor to get the signal from the top level board diagram.

One thing that you need to be careful of is that subtle things like the clear signals from the hazard unit can change the state of the processor while it is stalled. You will need to find a way to make sure that the board only changes its state when not stalled, even if you have to add logic to do so. Once you have properly established the stall signals on the board, run some tests on it using the included test bench. If everything seems to work as expected, move on to phase 2 and test your board on the hardware to make sure that the single core design works with the caches and memory.

Phase 2: Hardware Testing - Round 1

Now that you have your single core processor working in simulation with the caches you will want to make sure that this design still works on the boards. Synthesize and implement your boards using board.bde as the top level design and the provided bios.ucf as the UCF file. All other settings should be preserved from previous labs. Test your processor on the boards with previous programs (or new ones) and make sure that they still function as expected. If everything looks like it still works properly, you are ready to move on to adding the second core to the design.

Phase 3: Adding the second processor

To start designing our dual core design, you will have to place a second copy of your CPU on the board. This CPU will require its own IOWedge, data cache, and instruction cache. Wire these up appropriately on the board. Access to the IO devices and instruction memory are provided through the IOArbiter and existing MemoryArbiter on the board. You should be able to just connect your new instruction cache and IO Wedge to these devices and that part of the system will be taken care of.

These arbiters work by forwarding requests from the processors to the IO Devices and Instruction Memory and resolving conflicts by stalling one processor while the other request completes. If a conflict occurs, the processor with priority is granted access to the device in question while the other processor is stalled. Priority between the two processors is swapped after each conflict to ensure that one processor does not manage to keep control of the instruction memory/IO devices and exclude the other processor entirely.

You'll notice that there is no existing arbiter for the data cache. In order to maintain coherency, you'll need to construct a coherency engine to make sure that the two data caches stay coherent. This will be explained in the next phase.

Phase 4: Keeping things coherent

In order to make sure that your two cores play nicely, you will be creating a coherency engine. The job of the coherency engine is to ensure that only one cache has data from the same address at a time. This means that when one cache loads a line from memory, the other cache will need to check if it has that line. If the other cache does it will need to invalidate it (possibly by writing back to memory first if it is dirty). It will also need to ensure that one cpu isn't constantly getting priority access over the other cpu. In order to accomplish these tasks, you will need to devise a coherency protocol between the caches and coherency engine.

The coherency protocol is basically how the caches will be talking to the coherency engine and vice versa so that the caches know when to wait, invalid, read data, etc. There are a couple of different ways to design this protocol and it is up to you to decide. We recommend that you sketch/write out your protocol on paper with state diagrams first to make sure you aren't missing anything crucial. There will be a checkpoint about a week into Lab 4 to ensure that everyone has started. We will be interviewing each group to ensure that you have put some effort into the lab. We don't expect a working coherency engine, but just that you have a protocol thought out and started implementing it in software.

There will be two parts you will be editing/creating in this stage, the coherency engine and the data caches. You should not need to modify any of the other components of the lab for this section. In order to test your coherency engine/cache, we hook up a single core processor and test it with the bios to ensure that it works. Eventually you will write a program that will test both cores. If you have a partner, it is possible to work in parallel between building a program and the coherency engine.

Phase 5: A new instruction

We will be hijacking the Load Linked instruction (LL) and replacing it with our own instruction: Load and Store. This new instruction will take the contents of the RT register and store it into the memory location specified by the RS register + offset, but at the same time it will load the contents of the memory location specified by the RS register + offset into the RT register. Basically it is a swap between the register and memory.

You will need to update your processor to accomodate this new instruction by modifying your controller. Ideally if you have implemented everything else correctly that will be all you need to change for this instruction. Do not pick a random instruction to be used for the Load and Store opcode. Refer to the MIPS Volume 2 Instruction Set Guide for the LL instruction opcode and use that. When writing your program to use this instruction in assembly, use the LL syntax (do not try to do something like LAS).

UPDATE: Due to compiler issues we have changed the instruction that will be "hijacked" for Load and Store. INSTEAD OF LL USE LWL - this is opcode 34

Phase 6: Programming

Now that you have a coherency engine and your new instruction, you are ready to write a test program that will use both cores. In order for your cpus to run different sections of code, you will need to cause them to branch to different parts. This can be accomplished by using our new instruction and a BEQ to do a "load and store and then branch if equals old value." You will probably want to start with a simple program, such as one that shares a counter in order to actually test your coherency engine.

Once you have a working dual core processor, write a program that will take advantage of the dual cores. The program needs to actaully run faster on dual cores than it would on a single core (at least in theory). Once done, show the working program to a TA for a checkoff and submit your processor through turnin.