Lab 4: Caches and Cache Performance

Assigned Friday, May 11
Due Date Monday May 21
Files lab4.tar
Submissions Submit your files via the Canvas assignments page (go to the Labs section, not the Homeworks section). Be sure to follow the detailed instructions at the bottom of this page.

Learning Objectives and Overview

In the first part of this lab, Chip D. Signer, Ph.D., is trying to reverse engineer a competitor's microprocessors to discover their cache geometries and has recruited you to help. Instead of running programs on these processors and inferring the cache layout from timing results, you will approximate his work by using a simulator.

The second part of this lab involves trying to design a Matrix Transpose function that has as few cache misses as possible.

Code for this lab

Running tar xvf lab4.tar from the terminal will extract the lab files to a directory called lab4 with the following files:

Instructions for Part I

Inferring Mystery Cache Geometries  [27 points]

This lab should be done on a 64-bit machine. Use the CSE VM, attu, or your own personal 64-bit computer. You will need the C standard libraries on the machine you use. Otherwise, all the files you need have been provided.

Each of the “processors” is provided as an object file (.o file) against which you will link your code. The file mystery-cache.h defines the function interface that these object files export. It includes a typedef for a type addr_t (an unsigned 8-byte integer) which is what these (pretend) caches use for “addresses”, or you can use any convenient integer type.

typedef unsigned long long addr_t;
typedef unsigned char bool_t;
#define TRUE 1
#define FALSE 0

/** Lookup an address in the cache. Returns TRUE if the access hits,
    FALSE if it misses. */
bool_t access_cache(addr_t address);

/** Clears all words in the cache (and the victim buffer, if
    present). Useful for helping you reason about the cache
    transitions, by starting from a known state. */
void flush_cache(void);

Your job is to fill in the function stubs in cache-test-skel.c which, when linked with one of these cache object files, will determine and then output the cache size, associativity, and block size. You will use the functions above to perform cache accesses and use your observations of which ones hit and miss to determine the parameters of the caches.

Some of the provided object files are named with this information (e.g. cache_65536c_2e_16k.o is a 65536 Byte capacity, 2-way set-associative cache with 16 Byte blocks) to help you check your work. There are also 4 mystery cache object files, whose parameters you must discover on your own.

You can assume that the mystery caches have sizes that are powers of 2 and use a least-recently-used replacement policy. You cannot assume anything else about the cache parameters except what you can infer from the cache size. Finally, the mystery caches are all pretty realistic in their geometries, so use this fact to sanity check your results.

The provided Makefile includes a target cache-test. To use it, set TEST_CACHE to the object file to link against on the command line. That is, from within the lab4 directory run the command:

make cache-test TEST_CACHE=cache_65536c_2e_16k.o

This will create an executable cache-test that will run your cache-inference code against the supplied cache object. Run this executable like so:

./cache-test

and it will print the results to the screen.

You may find the testCache.sh script useful. It makes and runs cache-test with all the object files. You can run this shell script like so:

bash testCache.sh

You are not rewquired to use testCache.sh.

Your Tasks

Complete the three functions in cache-test-skel.c that have /* YOUR CODE GOES HERE */ comments in them.

Additionally, determine the geometry of each of the four mystery caches and list these in a comment, along with your name, at the top of your modified cache-test-skel.c.

Note: You should NOT be calling any functions other than flush_cache and access_cache inside of your functions. For example, you may not call get_block_size() say inside of get_cache_assoc.

Instructions for Part II

Optimizing Matrix Transpose  [15 points]

In Part II, you will write a transpose function in trans.c that causes as few cache misses as possible.

Let A denote a matrix, and Aij denote the component in the ith row and jth column. The transpose of A, denotated AT is a matrix such that Aij = ATji

To help you get started, we have given you an example transpose function in trans.c that computes the transpose of M x N matrix A and stores the results in N x M matrix B:

char trans_desc[] = "Simple row-wise scan transpose";
void trans(int M, int N, int A[M][N], int B[M][N])

The example transpose function is correct, but it is inefficient because the access pattern results in relatively many cache misses.

Your job in Part II is to write a similar function, called transpose_submit, that minimizes the number of cache misses across different sized matrices on a simulated cache with parameters s=5, E=1, b=5:

char trans_desc[] = "Transpose submission";
void transpose_submit(int M, int N, int A[M][N], int B[M][N])

Do not change the description string "Transpose submission" for your transpose_submit function. The autograder searches for this string to determine which transpose function to evaluate for credit.

Trace Files

The Linux program valgrind can generate traces of memory accesses of programs. For example, typing

linux> valgrind --log-fd=1 --tool=lackey -v --trace-mem=yes ls -l

on the command line runs ls -l, captures the memory accesses in order, and prints them to stdout.

valgrind memory accesses have the following form:

I 0400d7d4,8
M 0421c7f0,4
L 04f6b868,8
S 7ff0005c8,8

Each line denotes one or two memory accesses. The format of each line is:

[space]operation address,size

The operation field denotes the type of memory access: "I" denotes an instruction load, "L" a data load, "S" a data store and "M" a data modify (i.e., a data load followed by a data store). There is never a space before each "I". There is always a space before each "M", "L", and "S". The address field specifies a 64-bit hexadecimal memory address. The size field specifies the number of bytes accessed by the operation.

Tools in this lab use valgrind traces of your code running on a cache simulator to evaluate the efficieny of your transpsoe function. See the Hints section for more information.

Programming Rules for Part II

Performance

For Part II, we will evaluate the correctness and performance of your transpose_submit function on 3 different-sized input matrices:

For each matrix size, the performance of your transpose_submit function is evaluated by using valgrind to extract the address trace for your function, and then using the reference simulator to replay this trace on a cache with parameters s = 5, E = 1, b=5 .

Your performance score for each matrix size scales linearly with the number of misses, m, up to some threshold:

Your code must be correct to receive any performance points for a particular size. Your code only needs to be correct for these three cases and you can optimize it specifically for these three cases. In particular, it is perfectly okay for your function to explicitly check for the input sizes and implement separate code optimized for each case.

We have provided you with a driver program, called ./driver.py, that performs a complete evaluation of your transpose code. The driver uses test-trans to evaluate your submitted transpose function on the three matrix sizes. Then it prints a summary of your results and the points you have earned. To run the driver, run this command at the terminal:

python driver.py

Extra Credit

Extra credit will be awarded for code that results in significantly fewer misses than the full score "baselines."

For reference, there exist solutions with m as low as 300 (32 x 32), 1200 (64 x 64), and 1920 (67 x 61).

Tools

We have provided you with an autograding program, called test-trans.c that tests the correctness and performance of each of the transpose functions that you have registered with the autograder.

You can register up to 100 versions of the transpose function in your trans.c file. Eache transpose version has the following form:

/* Header comment */
char trans_simple_desc[] = "A simple transpose";
void trans_simple(int M, int N, int A[N][M], int B[M][N])
{
    /* your transpose code here */
}

Register a particular transpose function witht he autograder by making a call of the form:

registerTransFunction(trans_simple, trans_simple_desc);

in the registerFunctions routine in trans.c. At runtime, the autograder will evaluate each registered transpose function and print the results. Of course, one of the registered functions must be the transpose_submit function that you are submitting for credit:

registerTransFunction(transpose_submit, transpose_submit_desc);

See the default trans.c function for an example of how this works.

The autograder takes the matrix size as input. It uses valgrind to generate a trace of each registered transpose function. It then evaluates each trace by running a chache simulator called csim-ref on a cache with parameters s=5, E=1, b=5, though the csim-ref uses the equlivent notation from the book s=5, E=1, b=5.

For example, to test your registered transpose functions on a 32 x 32 matrix, rebuild test-trans, and then run it with the appropriate values for M and N:

linux> make
linux> ./test-trans -M 32 -N 32
Step 1: Evaluating registered transpose funcs for correctness:
func 0 (Transpose submission): correctness: 1
8
func 1 (Simple row-wise scan transpose): correctness: 1
func 2 (column-wise scan transpose): correctness: 1
func 3 (using a zig-zag access pattern): correctness: 1
Step 2: Generating memory traces for registered transpose funcs.
Step 3: Evaluating performance of registered transpose funcs (s=5, E=1, b=5)
func 0 (Transpose submission): hits:1766, misses:287, evictions:255
func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151
func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151
func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945
Summary for official submission (func 0): correctness=1 misses=287

In this example, we have registered four different transpose functions in trans.c. The test-trans program tests each of the registered functions, displays the results for each, and extracts the results for the official submission.

Hints

Lab 4 Reflection

One use of knowing/looking at assembly is understanding what's actually going on, since some “magic” happens between your source code and executable.

Go back to lab0.c and revert the nested for loops in part4() to the “ijk” ordering. Compile lab0.c with the following two optimization levels and compare the dissassembly for part4():

$ gcc -g -std=c99 -o lab0 lab0.c
$ gcc -g -O2 -std=c99 -o lab0opt lab0.c
    1. For the unoptimized code, how many data memory accesses are performed by the execution of these loops? Do not include instruction fetches.
    2. Repeat the previous question but for the optimized code.
    [1 pt]
  1. Provide at least one reason for the reduction in memory accesses via compiler optimization in the previous question.  [1 pt]
  2. Name two different ways that you have been taught to make “faster” code, one from this class and one from another class you have taken. Include an example of each method.  [1 pt]
  3. One way to measure performance is to literally use a stopwatch that you start when a program-execution starts and end when a program-execution ends. Why could this produce inconsistent results? That is, why could running the same program with the same inputs on the same computer take different total amounts of elapsed time?  [1 pt]

Submission Instructions

Submit the following files:

  1. cache-test-skel.c
  2. trans.c
  3. lab4reflect.txt

Make sure that you have included the geometry of the four mystery caches in a comment in cache-test-skel.c and your final transpose code is in a function named transpose_submit.

Once you're satisfied with your solutions, submit them via Canvas (link is at the top of this page).