Due: Dec 2 via the Catalyst dropbox

Questions about this assignment should go to the TAs on the discussion board.

Introduction

The purpose of this assignment is to further your understanding of cache coherence. In class we studied a simple 3-state (MSI) snooping cache coherency protocol. Although two bits are needed to encode the coherency state in this protocol, only three of the four possible values are used: invalid, shared, and modified. This begs for a coherency protocol that utilizes the fourth value to implement a fourth state that improves multiprocessor performance in some way.

In this project you are going to implement and evaluate a 4-state protocol. The protocol states in the 3-state protocol are (1) Modified (dirty); (2) Shared (clean); and (3) Invalid. You will be modifying this 3-state protocol by splitting the Modified state into two states: Modified, which is a private dirty state; and Exclusive, which is a private clean state.

In this assignment you will justify the usefulness of a 4-state protocol, implement the protocol as a plugin to a Pin-based simulator, and analyze its effectiveness on some benchmarks.

Justifying the Protocol

MESI Cache Diagram

Shown above is a finite state diagram for the 4-state coherence protocol proposed. Hypothesize why distinguishing a Modified state from Exclusive could provide any performance benefit. For this it is important to think about situations in which the 3-state protocol differs from the 4-state protocol. Do these situations always lead to performance improvement? Are there situations in which a 4-state protocol is detrimental? You should argue as though you are trying to convince a design team that this change is a good or a bad idea.

Justify the design in a brief paragraph.

Simulation Infrastructure Background

Your (imaginary) boss and team members were encouraged by your justification, but would like more evidence that your 4-state protocol is a good idea before investing time desiging hardware to do it. You will make a more convincing argument by simulating this new protocol and showing how it performs on some benchmarks.

Luckily someone has given you a simulator called MultiCacheSim, designed for testing multi-processor cache coherence protocols, and has already implemented a plugin for a 3-state MSI protocol. The simulation infrastructure is built using a binary instrumentation tool called Pin. For this assignment, you don't need to understand how to use Pin in detail. In short, Pin runs a specified program subject to a set of rules and modifications described in a "pintool". A pintool is able to identify certain events during a program's execution and call arbitrary code (i.e., code you write) when those events occur. MultiCacheSim intercepts calls to load and store from memory within a program and simulates executing those memory requests on a multi-core cache kept coherent by a protocol plugin.

This assignment uses Pin to dynamically instrument binaries. To ensure a consistent environment and to help with debugging, we use a virtual machine provisioned with everything you'll need pre-installed. To setup this VM, install Vagrant (note: you'll need at least v1.6; the version in Ubuntu's apt-get repository (1.4.3!!) is too old) and VirtualBox, then download and fire up the VM with:

mkdir hw5; cd hw5
vagrant init bholt/cse548pin
vagrant up

This will download a pre-built image. If that for some reason doesn't work, you can try this instead, which starts from the base Ubuntu 14.04 image and provisions it manually:

mkdir hw5; cd hw5
curl -O http://courses.cs.washington.edu/courses/csep548/15au/pinvm/Vagrantfile
vagrant up

To connect to the VM to build and run experiments, etc, you can ssh into the VM with vagrant ssh. Secondly, by default, it will share the directory you created with the VM at /vagrant, so you could, for instance, move the hw5-cache-coherence directory to /vagrant and edit the files from the host.

Note: hw5-cache-coherence directory may be named hw4-cache-coherence in the VM. This assignment was hw 4 in a previous offering, and the VM was not repackaged this time around. We apologize for this naming inconsistency, but the instructions should be clear as to which directory we mean.

Important: the first time you run Pin in the VM, it will complain about not being able to inject code into executables:

E:  The Operating System configuration prevents Pin from using the default (parent) injection mode.
E:  To resolve this, either execute the following (as root):
E:  $ echo 0 > /proc/sys/kernel/yama/ptrace_scope
E:  Or use the "-injection child" option.
E:  For more information, regarding child injection, see Injection section in the Pin User Manual.

To fix this, just run the following:

sudo su
echo 0 > /proc/sys/kernel/yama/ptrace_scope
exit

Set it up yourself

Try this at your own risk. Pin can be installed and run on most Linux distros and supposedly Windows, too (note: support for OSX is only experimental). If you choose to go this route, you can download the homework starting code, MultiCacheSim, Pin and Parsec yourself. However, we only have the reference version of MESI_SMPCache.so built for Linux, so if you cannot run it, you won't have a reference version to compare against.

Implementation

To get you started on your implementation, we have provided you with a plugin that implements the 3-state MSI protocol (MSI_SMPCache). It will benefit you to read this code carefully to understand how the simulator works. In the VM, this starter code is in ~/hw4-cache-coherence. To build the 3-state protocol plugin:

cd hw4-cache-coherence
make

To run MultiCacheSim and load this cache coherence plugin on, for example, ls, run:

pin -t $MULTICACHESIM -protos ./MSI_SMPCache.so -numcaches 2 -- /bin/ls

This command will run /bin/ls in a simulation of 2 processors with caches that are kept coherent by the protocol implemented in MSI_SMPCache.so. However, ls isn't multithreaded, so you won't see anything interesting. For a more interesting test, we'll run some Parsec benchmarks next.

Testing on Parsec

PARSEC is a suite of benchmarks designed to be representative of modern multiprocessor workloads, made up of many different kinds of applications including computer vision, physics simulations, and more. This suite has been installed in the VM for you, but we only have the small inputs (test and simdev) to save on download time, which should be fine for our purposes.

Before running a benchmark, you must build it with the command:

parsecmgmt -a build -p <benchmarkname>

To run a PARSEC benchmark with the cache simulator use the following command:

parsecmgmt -a build -p <benchmarkname>
parsecmgmt -a run -p <benchmarkname> -l -i <input> -d <workdirectory> -n 8 \
  -s "pin -t $MULTICACHESIM -protos `pwd`/MSI_SMPCache.so -numcaches 8 -- "

Some of the simpler and faster-running benchmarks are: blackscholes, ferret, and fluidanimate. It's usually best to run on the simdev input, and <workdirectory> just must be a directory writable by you. Note, this command assumes the current working directory is in the hw5-cache-coherence directory. In order to work correctly, you must specify absolute paths (hence the use of pwd). To run blackscholes, for instance:

parsecmgmt -a run -p blackscholes -i simdev -l -d $HOME/scratch -n 8 -s \
  "pin -t $MULTICACHESIM -protos `pwd`/MESI_SMPCache.so -numcaches 8 -- "

Protocol Plugins

A protocol plugin keeps caches in a MultiCacheSim simulation coherent. The provided MSI_SMPCache is an example that illustrates the essential components of a protocol plugin. A protocol plugin must implement 4 interface methods:

  1. readLine(readPC, addr)
  2. writeLine(writePC, addr)
  3. readRemoteAction(addr)
  4. writeRemoteAction(addr)

readLine() and writeLine():

The readLine and writeLine methods implement protocol actions taken in the event of a read or write. These methods must do several things:

  1. check the tags of the block to determine if the access is a miss
  2. call readRemoteAction() or writeRemoteAction(), which implements the snoop in remote caches
  3. determine the correct state in which to cache the block being accessed
  4. cache the block

readRemoteAction() and writeRemoteAction():

These two methods implement snooping of remote transactions on the bus. The methods are called by a simulated processor when it makes a memory access. In these methods the processor iterates through the other caches, updating their state as though they have snooped its memory access. writeRemoteAction is called from within writeLine() and readRemoteAction is called from within readLine().

These methods return messenger objects that provide information about coherence state. readRemoteAction returns an object with two fields: isShared and providedData. isShared should be true if the line being accessed is in shared state in a remote processor. providedData should be true if the snooping processor put its data on the bus for the accessing processor. The fields of this messenger object can be used to determine whether an access was serviced by a remote cache, or by memory.

writeRemoteAction():

Returns an object with one field, empty, which is not necessary for your simulations.

Implementing Your Protocol Plugin

When building your protocol plugin, you should work from the provided MESI_SMPCache.{h,cpp}. These do not yet implement the MESI protocol. They have simply been copied from the provided MSI_SMPCache.{h,cpp} files, with only the class renamed. This is to enable us to more easily grade the assignments, but it should also provide you with a good place to start from.

The provided MSI plugin implements most of the functionality you will need, and illustrates many helpful simulator mechanisms (how to get a cache line's state, for example). You should be able to implement your protocol by understanding the implementation of the four methods we've given you, and changing the protocol logic to implement the fourth state. Hint: this may involve adding state to RemoteReadService.

You have been provided with a reference solution at: ~/MultiCacheSim/MESI_SMPCache.so. To verify your implementation matches, you can either specify both protocol implementations by listing them (comma-separated) and compare outputs:

parsecmgmt -a run -p blackscholes -i simdev -l -d $HOME/workspace -n 8 -s \
  "pin -t $MULTICACHESIM -protos `pwd`/MESI_SMPCache.so,$HOME/MultiCacheSim/MESI_SMPCache.so -numcaches 8 -- "

Or MultiCacheSim lets you specify a reference version and logs when your implementation differs from the reference. For example, to test your MESI implementation on Blackscholes, use a command like:

parsecmgmt -a run -p blackscholes -i simdev -l -d $HOME/workspace -n 8 -s \
  "pin -t $MULTICACHESIM -protos `pwd`/MESI_SMPCache.so -numcaches 8 -reference $HOME/MultiCacheSim/MESI_SMPCache.so -printOnProtoBug -- "

Note: you should run them both at the same time (either by listing protos or the using -reference) to do comparisons. Different runs may exhibit different non-deterministic behavior, so they aren't directly comparable.

Experiments

To better understand the impact of the 4-state protocol, you will conduct a series of experiments.

When you run your simulations, you should run your 4-state protocol implementation alongside the provided 3-state protocol implementation. By doing so, you will get numbers that came from the same execution trace, and can, therefore, be compared directly. Recall that you can specify a comma-separated list of protocol plugins to MultiCacheSim to simulate multiple protocols on the same execution.

Part of your analysis will be to evaluate the relative scalability of the 4-state protocol to the 3-state protocol. In order to do this analysis, you will have to run your experiments three times--once with 2 threads, once with 4 threads, and once with 8 threads. Take note to change the thread count option that you are providing to parsecmgmt (e.g., -n 8) and the one you are providing to MultiCacheSim (e.g., -numcaches 8).

One goal of this assignment is to give you experience dealing with the heaps of semi-structured data produced by architectural simulators. It is up to you to keep your data organized. Keeping things neat will help you stay focused on the interesting parts of the experiments.

Just choose one or two benchmarks to test with, we don't want to kill you with excessive data collection. Note, the output from the cache simulator should be valid CSV, so you should be able to copy/paste the output of these commands into Excel or your favorite data analysis applications.

Analysis

After completing your experiments, you will analyze the data collected from these experiments. Use this data to evaluate the relative scalability of the 4- and 3-state protocols and to support or refute your hypothesis about their performance.

The first step to your analysis should be to identify the simulation outputs that vary between the 3-state and 4-state protocols. You should try to be as inclusive as possible during this step-don't limit yourself to analyzing only the outputs that you expected to vary. It will probably be helpful during this step to focus on a subset of your data. For example, to try to decide what the variables of interest are, you might want to look at only one or two benchmarks at first. You also probably want to restrict your initial attention to a single thread-count. Plotting the data you've collected will help you to see trends in your data. Once you've decided which variables are of interest, figure out why you are seeing the difference you are seeing.

Next, you should look at how the differences in these variables of interest change as you vary the number of threads in the simulation. For example, does the 4-state protocol lead to a bigger difference in some output value when there are more threads? What conclusions can you draw regarding the performance scalability of these two protocols from the data you've collected?

Write about a paragraph describing the results you found. Include a table or plot with some of your results to help explain the difference you observe. Make sure you say what PARSEC benchmark(s) you ran. Remember, we're not concerned about you doing a full survey of benchmarks, just enough to show us you found something interesting.

Turn-in

Code: Tar and gzip your hw5-cache-coherence directory. If you've modified the makefile at all, make sure your MESI implementation can still be built by just running make.

Writeup: Your writeup should consist of a paragraph explaining the justification/hypothesis, and about a paragraph explaining your findings from the experiments you ran.

Submit both of these to the Catalyst dropbox.

Please ensure that you remove (or comment out) any debug printouts that you have added before submitting.


This assignment is based on the Cache Coherence homeworks from CSE 471. Thank you to Brandon Lucia for the Pin simulator (MultiCacheSim) and the assignment spec.