Lab 4: Caches and Cache Performance

Assigned	Friday, May 11
Due Date	Monday May 21
Files	lab4.tar
Submissions	Submit your files via the Canvas assignments page (go to the Labs section, not the Homeworks section). Be sure to follow the detailed instructions at the bottom of this page.

Learning Objectives and Overview

Gain basic familiarity with cache geometries and how different associativities and line sizes present themselves.
Evaluate a claim made in the lecture slides that two seemingly equivalent ways of writing a program can have vastly different performance.
Gain experience in designing cache-friendly code.

In the first part of this lab, Chip D. Signer, Ph.D., is trying to reverse engineer a competitor's microprocessors to discover their cache geometries and has recruited you to help. Instead of running programs on these processors and inferring the cache layout from timing results, you will approximate his work by using a simulator.

The second part of this lab involves trying to design a Matrix Transpose function that has as few cache misses as possible.

Code for this lab

Browser (click to download): lab4.tar
Terminal: wget https://courses.cs.washington.edu/courses/cse351/18sp/labs/lab4.tar

Running tar xvf lab4.tar from the terminal will extract the lab files to a directory called lab4 with the following files:

cache_*.o - [part I] "Caches" with known parameters for testing your code
mystery*.o - [part I] "Caches" with unknown parameters that you will need to discover
mystery-cache.h - [part I] Defines the function interface that the object files export
cache-test-skel.c - [part I] Skeleton code for determining cache parameters
Makefile - [part I & II] For compiling and linking
testCache.sh - [part I] Optional script that uses the Makefile to run cache-test for all the different caches
trans.c - [part II] Place for your transpose functions
test-trans.c - [part II] code to test efficiency of your functions in trans.c
driver.py - [part II] Script to run all part II tests and calculate score
tracegen.c - [part II] used by test-trans.c to generate memory traces
csim-ref - [part II] Cache Simulator used by test-trans
cachelab.* - [part II] Helper code used by other programs in part II
lab4reflect.txt - For your Reflection responses

Instructions for Part I

Inferring Mystery Cache Geometries [27 points]

This lab should be done on a 64-bit machine. Use the CSE VM, attu, or your own personal 64-bit computer. You will need the C standard libraries on the machine you use. Otherwise, all the files you need have been provided.

Each of the “processors” is provided as an object file (.o file) against which you will link your code. The file mystery-cache.h defines the function interface that these object files export. It includes a typedef for a type addr_t (an unsigned 8-byte integer) which is what these (pretend) caches use for “addresses”, or you can use any convenient integer type.

typedef unsigned long long addr_t;
typedef unsigned char bool_t;
#define TRUE 1
#define FALSE 0

/** Lookup an address in the cache. Returns TRUE if the access hits,
    FALSE if it misses. */
bool_t access_cache(addr_t address);

/** Clears all words in the cache (and the victim buffer, if
    present). Useful for helping you reason about the cache
    transitions, by starting from a known state. */
void flush_cache(void);

Your job is to fill in the function stubs in cache-test-skel.c which, when linked with one of these cache object files, will determine and then output the cache size, associativity, and block size. You will use the functions above to perform cache accesses and use your observations of which ones hit and miss to determine the parameters of the caches.

Some of the provided object files are named with this information (e.g. cache_65536c_2e_16k.o is a 65536 Byte capacity, 2-way set-associative cache with 16 Byte blocks) to help you check your work. There are also 4 mystery cache object files, whose parameters you must discover on your own.

You can assume that the mystery caches have sizes that are powers of 2 and use a least-recently-used replacement policy. You cannot assume anything else about the cache parameters except what you can infer from the cache size. Finally, the mystery caches are all pretty realistic in their geometries, so use this fact to sanity check your results.

The provided Makefile includes a target cache-test. To use it, set TEST_CACHE to the object file to link against on the command line. That is, from within the lab4 directory run the command:

make cache-test TEST_CACHE=cache_65536c_2e_16k.o

This will create an executable cache-test that will run your cache-inference code against the supplied cache object. Run this executable like so:

./cache-test

and it will print the results to the screen.

You may find the testCache.sh script useful. It makes and runs cache-test with all the object files. You can run this shell script like so:

bash testCache.sh

You are not rewquired to use testCache.sh.

Your Tasks

Complete the three functions in cache-test-skel.c that have /* YOUR CODE GOES HERE */ comments in them.

Additionally, determine the geometry of each of the four mystery caches and list these in a comment, along with your name, at the top of your modified cache-test-skel.c.

Note: You should NOT be calling any functions other than flush_cache and access_cache inside of your functions. For example, you may not call get_block_size() say inside of get_cache_assoc.

Instructions for Part II

Optimizing Matrix Transpose [15 points]

In Part II, you will write a transpose function in trans.c that causes as few cache misses as possible.

Let A denote a matrix, and A_ij denote the component in the ith row and jth column. The transpose of A, denotated A^T is a matrix such that A_ij = A^T_ji

To help you get started, we have given you an example transpose function in trans.c that computes the transpose of M x N matrix A and stores the results in N x M matrix B:

char trans_desc[] = "Simple row-wise scan transpose";
void trans(int M, int N, int A[M][N], int B[M][N])

The example transpose function is correct, but it is inefficient because the access pattern results in relatively many cache misses.

Your job in Part II is to write a similar function, called transpose_submit, that minimizes the number of cache misses across different sized matrices on a simulated cache with parameters s=5, E=1, b=5:

char trans_desc[] = "Transpose submission";
void transpose_submit(int M, int N, int A[M][N], int B[M][N])

Do not change the description string "Transpose submission" for your transpose_submit function. The autograder searches for this string to determine which transpose function to evaluate for credit.

Trace Files

The Linux program valgrind can generate traces of memory accesses of programs. For example, typing

linux> valgrind --log-fd=1 --tool=lackey -v --trace-mem=yes ls -l

on the command line runs ls -l, captures the memory accesses in order, and prints them to stdout.

valgrind memory accesses have the following form:

I 0400d7d4,8
M 0421c7f0,4
L 04f6b868,8
S 7ff0005c8,8

Each line denotes one or two memory accesses. The format of each line is:

[space]operation address,size

The operation field denotes the type of memory access: "I" denotes an instruction load, "L" a data load, "S" a data store and "M" a data modify (i.e., a data load followed by a data store). There is never a space before each "I". There is always a space before each "M", "L", and "S". The address field specifies a 64-bit hexadecimal memory address. The size field specifies the number of bytes accessed by the operation.

Tools in this lab use valgrind traces of your code running on a cache simulator to evaluate the efficieny of your transpsoe function. See the Hints section for more information.

Programming Rules for Part II

Include your name and NetID in the header comment for trans.c
Your code in trans.c must compile without warnings to receive credit.
You are allowed to define at most 12 local variables of type int per transpose function. The reason for this restriction is that our testing code is not able to count references to the stack. We want you to limit your references to the stack and focus on the access patterns of the source and destination arrays.
Arrays DO count towards your 12 local ints. For example, if you had 4 ints, you could declare an array of ints up to size 8.
You are not allowed to side-step the previous rule by using any variables of type long or by using any bit tricks to store more than one value to a single variable.
Your transpose function may not use recursion.
If you choose to use helper functions, you may not have more than 12 local variables on the stack at a time between your helper functions and your top level transpose function. For example, if your transpose declares 8 variables, and then you call a function which uses 4 variables, which calls another function which uses 2, you will have 14 variables on the stack, and you will be in violation of the rule.
Your transpose function may not modify array A. You may, however, do whatever you want with the contents of array B.
You are NOT allowed to use any variant of malloc.

Performance

For Part II, we will evaluate the correctness and performance of your transpose_submit function on 3 different-sized input matrices:

32 x 32 (M=32, N=32)
64 x 64 (M=64, N=64)
67 x 61 (M=67, N=61)

For each matrix size, the performance of your transpose_submit function is evaluated by using valgrind to extract the address trace for your function, and then using the reference simulator to replay this trace on a cache with parameters s = 5, E = 1, b=5 .

Your performance score for each matrix size scales linearly with the number of misses, m, up to some threshold:

32 x 32: 5 points if m ≤ 330, 0 points if m > 600
64 x 64: 5 points if m ≤ 1,800, 0 points if m > 2,400
67 x 61: 5 points if m ≤ 2,100, 0 points if m > 3,000

Your code must be correct to receive any performance points for a particular size. Your code only needs to be correct for these three cases and you can optimize it specifically for these three cases. In particular, it is perfectly okay for your function to explicitly check for the input sizes and implement separate code optimized for each case.

We have provided you with a driver program, called ./driver.py, that performs a complete evaluation of your transpose code. The driver uses test-trans to evaluate your submitted transpose function on the three matrix sizes. Then it prints a summary of your results and the points you have earned. To run the driver, run this command at the terminal:

python driver.py

Extra Credit

Extra credit will be awarded for code that results in significantly fewer misses than the full score "baselines."

For reference, there exist solutions with m as low as 300 (32 x 32), 1200 (64 x 64), and 1920 (67 x 61).

Tools

We have provided you with an autograding program, called test-trans.c that tests the correctness and performance of each of the transpose functions that you have registered with the autograder.

You can register up to 100 versions of the transpose function in your trans.c file. Eache transpose version has the following form:

/* Header comment */
char trans_simple_desc[] = "A simple transpose";
void trans_simple(int M, int N, int A[N][M], int B[M][N])
{
    /* your transpose code here */
}

registerTransFunction(trans_simple, trans_simple_desc);

in the registerFunctions routine in trans.c. At runtime, the autograder will evaluate each registered transpose function and print the results. Of course, one of the registered functions must be the transpose_submit function that you are submitting for credit:

registerTransFunction(transpose_submit, transpose_submit_desc);

See the default trans.c function for an example of how this works.

The autograder takes the matrix size as input. It uses valgrind to generate a trace of each registered transpose function. It then evaluates each trace by running a chache simulator called csim-ref on a cache with parameters s=5, E=1, b=5, though the csim-ref uses the equlivent notation from the book s=5, E=1, b=5.

For example, to test your registered transpose functions on a 32 x 32 matrix, rebuild test-trans, and then run it with the appropriate values for M and N:

linux> make
linux> ./test-trans -M 32 -N 32
Step 1: Evaluating registered transpose funcs for correctness:
func 0 (Transpose submission): correctness: 1
8
func 1 (Simple row-wise scan transpose): correctness: 1
func 2 (column-wise scan transpose): correctness: 1
func 3 (using a zig-zag access pattern): correctness: 1
Step 2: Generating memory traces for registered transpose funcs.
Step 3: Evaluating performance of registered transpose funcs (s=5, E=1, b=5)
func 0 (Transpose submission): hits:1766, misses:287, evictions:255
func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151
func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151
func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945
Summary for official submission (func 0): correctness=1 misses=287

In this example, we have registered four different transpose functions in trans.c. The test-trans program tests each of the registered functions, displays the results for each, and extracts the results for the official submission.

Hints

Blocking is a useful technique for reducing cache misses. See http://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf and this intel post for more information, as well as materials from section.
The test-trans program saves the trace for the function i in the file trace.fi. These trace files are invaluable debugging tools that can help you understand exactly where the hits and misses for each transpose function are coming from. To debug a particular function, simply run its trace through the reference simulator with the verbose option:
```
linux> ./csim-ref -v -s 5 -E 1 -b 5 -t trace.f0
S 68312c,1 miss
L 683140,8 miss
L 683124,4 hit
L 683120,4 hit
L 603124,4 miss eviction
S 6431a0,4 miss
...
```
If you find mention online of a recursive “cache-oblivious algorithm” for matrix transpose, it is not a good choice for this assignment because recursion is not allowed in your solution. (It is a great idea to play with on your own.)
Since your transpose function is being evaluated on a direct-mapped cache, conflict misses are a potential problem. Think about the potential for conflict misses in your code, especially along the diagonal. Try to think of access patterns that will decrease the number of these conflict misses.
Only memory accesses to the heap are monitored. This may be helpful.... Just be careful not to break any programming rules!
Your code only needs to work on the three array sizes given. While in computer science we often try to avoid writing non-generalized code, in this case it may be quite helpful to optimize to specific cases to reach the desired level of performance.

Lab 4 Reflection

One use of knowing/looking at assembly is understanding what's actually going on, since some “magic” happens between your source code and executable.

Go back to lab0.c and revert the nested for loops in part4() to the “ijk” ordering. Compile lab0.c with the following two optimization levels and compare the dissassembly for part4():

$ gcc -g -std=c99 -o lab0 lab0.c
$ gcc -g -O2 -std=c99 -o lab0opt lab0.c

1. For the unoptimized code, how many data memory accesses are performed by the execution of these loops? Do not include instruction fetches.
2. Repeat the previous question but for the optimized code.
[1 pt]
Provide at least one reason for the reduction in memory accesses via compiler optimization in the previous question. [1 pt]
Name two different ways that you have been taught to make “faster” code, one from this class and one from another class you have taken. Include an example of each method. [1 pt]
One way to measure performance is to literally use a stopwatch that you start when a program-execution starts and end when a program-execution ends. Why could this produce inconsistent results? That is, why could running the same program with the same inputs on the same computer take different total amounts of elapsed time? [1 pt]

Submission Instructions

Submit the following files:

cache-test-skel.c
trans.c
lab4reflect.txt

Make sure that you have included the geometry of the four mystery caches in a comment in cache-test-skel.c and your final transpose code is in a function named transpose_submit.

Once you're satisfied with your solutions, submit them via Canvas (link is at the top of this page).