Assigned | Friday, May 11 |
---|---|
Due Date | Monday May 21 |
Files | lab4.tar |
Submissions | Submit your files via the Canvas assignments page (go to the Labs section, not the Homeworks section). Be sure to follow the detailed instructions at the bottom of this page. |
In the first part of this lab, Chip D. Signer, Ph.D., is trying to reverse engineer a competitor's microprocessors to discover their cache geometries and has recruited you to help. Instead of running programs on these processors and inferring the cache layout from timing results, you will approximate his work by using a simulator.
The second part of this lab involves trying to design a Matrix Transpose function that has as few cache misses as possible.
wget https://courses.cs.washington.edu/courses/cse351/18sp/labs/lab4.tar
Running tar xvf lab4.tar
from the terminal will extract the lab files to a directory called lab4
with the following files:
cache_*.o
- [part I] "Caches" with known parameters for testing your codemystery*.o
- [part I] "Caches" with unknown parameters that you will need to discovermystery-cache.h
- [part I] Defines the function interface that the object files exportcache-test-skel.c
- [part I] Skeleton code for determining cache parametersMakefile
- [part I & II] For compiling and linkingtestCache.sh
- [part I] Optional script that uses the Makefile to run cache-test for all the different cachestrans.c
- [part II] Place for your transpose functionstest-trans.c
- [part II] code to test efficiency of your functions in trans.c
driver.py
- [part II] Script to run all part II tests and calculate scoretracegen.c
- [part II] used by test-trans.c
to generate memory tracescsim-ref
- [part II] Cache Simulator used by test-trans
cachelab.*
- [part II] Helper code used by other programs in part IIlab4reflect.txt
- For your Reflection responsesThis lab should be done on a 64-bit machine. Use the CSE VM, attu, or your own personal 64-bit computer. You will need the C standard libraries on the machine you use. Otherwise, all the files you need have been provided.
Each of the “processors” is provided as an object file
(.o file) against which you will link your code. The file
mystery-cache.h
defines the function interface that these
object files export. It includes a typedef for a
type addr_t
(an unsigned 8-byte integer) which is what
these (pretend) caches use for “addresses”, or you can use any
convenient integer type.
typedef unsigned long long addr_t;
typedef unsigned char bool_t;
#define TRUE 1
#define FALSE 0
/** Lookup an address in the cache. Returns TRUE if the access hits,
FALSE if it misses. */
bool_t access_cache(addr_t address);
/** Clears all words in the cache (and the victim buffer, if
present). Useful for helping you reason about the cache
transitions, by starting from a known state. */
void flush_cache(void);
Your job is to fill in the function stubs in
cache-test-skel.c
which, when linked with one of these
cache object files, will determine and then output the cache size,
associativity, and block size. You will use the functions above to perform cache accesses and use your observations of which ones hit and miss to determine the parameters of the caches.
Some of the provided object files are
named with this information (e.g. cache_65536c_2e_16k.o
is a 65536 Byte capacity, 2-way set-associative cache with
16 Byte blocks) to help you check your work. There are also 4
mystery cache object files, whose parameters you must discover on your
own.
You can assume that the mystery caches have sizes that are powers of 2 and use a least-recently-used replacement policy. You cannot assume anything else about the cache parameters except what you can infer from the cache size. Finally, the mystery caches are all pretty realistic in their geometries, so use this fact to sanity check your results.
The provided Makefile
includes a
target cache-test
. To use it, set TEST_CACHE
to the object file to link against on the command line. That is, from within the
lab4 directory run the command:
make cache-test TEST_CACHE=cache_65536c_2e_16k.o
This will create an executable cache-test
that will run
your cache-inference code against the supplied cache object. Run this
executable like so:
./cache-test
and it will print the results to the screen.
You may find the testCache.sh
script useful. It
makes and runs cache-test
with all the object files.
You can run this shell script like so:
bash testCache.sh
You are not rewquired to use testCache.sh
.
Complete the three functions in cache-test-skel.c
that have
/* YOUR CODE GOES HERE */
comments in them.
Additionally, determine the geometry of each of the four mystery caches and
list these in a comment, along with your name, at the top of your modified
cache-test-skel.c
.
Note: You should NOT be calling any functions other than
flush_cache
and access_cache
inside of
your functions. For example, you may not
call get_block_size()
say inside
of get_cache_assoc
.
In Part II, you will write a transpose function
in trans.c
that causes as few cache misses as
possible.
Let A denote a matrix, and Aij denote the component in the ith row and jth column. The transpose of A, denotated AT is a matrix such that Aij = ATji
To help you get started, we have given you an example transpose
function in trans.c
that computes the transpose of M x N
matrix A and stores the results in N x M matrix B:
char trans_desc[] = "Simple row-wise scan transpose";
void trans(int M, int N, int A[M][N], int B[M][N])
The example transpose function is correct, but it is inefficient because the access pattern results in relatively many cache misses.
Your job in Part II is to write a similar function,
called transpose_submit
, that minimizes the number of
cache misses across different sized matrices on a simulated cache with
parameters s=5, E=1, b=5
:
char trans_desc[] = "Transpose submission";
void transpose_submit(int M, int N, int A[M][N], int B[M][N])
Do not change the description string "Transpose submission"
for your transpose_submit
function.
The autograder searches for this string to determine which transpose function to evaluate for credit.
The Linux program valgrind
can generate traces of memory accesses of programs. For example, typing
linux> valgrind --log-fd=1 --tool=lackey -v --trace-mem=yes ls -l
on the command line runs ls -l
, captures the memory accesses in order, and prints them to stdout
.
valgrind
memory accesses have the following form:
I 0400d7d4,8 M 0421c7f0,4 L 04f6b868,8 S 7ff0005c8,8
Each line denotes one or two memory accesses. The format of each line is:
[space]operation address,size
The operation field denotes the type of memory access: "I" denotes an instruction load, "L" a data load, "S" a data store and "M" a data modify (i.e., a data load followed by a data store). There is never a space before each "I". There is always a space before each "M", "L", and "S". The address field specifies a 64-bit hexadecimal memory address. The size field specifies the number of bytes accessed by the operation.
Tools in this lab use valgrind
traces of your code running on a cache simulator to evaluate the efficieny of your transpsoe
function. See the Hints section for more information.
trans.c
trans.c
must compile without warnings to receive credit.int
per transpose function.
The reason for this restriction is that our testing code is not able to count references to the stack. We want you to limit
your references to the stack and focus on the access patterns of the source and destination arrays.long
or by using any
bit tricks to store more than one value to a single variable.malloc
.For Part II, we will evaluate the correctness and performance of your transpose_submit
function on 3 different-sized input
matrices:
For each matrix size, the performance of your transpose_submit
function is evaluated by using
valgrind
to extract the address trace for your function, and then using the reference simulator to replay
this trace on a cache with parameters s = 5, E = 1, b=5
.
Your performance score for each matrix size scales linearly with the number of misses, m, up to some threshold:
Your code must be correct to receive any performance points for a particular size. Your code only needs to be correct for these three cases and you can optimize it specifically for these three cases. In particular, it is perfectly okay for your function to explicitly check for the input sizes and implement separate code optimized for each case.
We have provided you with a driver program, called ./driver.py
, that performs a complete evaluation
of your transpose code. The driver uses test-trans
to evaluate your submitted transpose function on the three matrix sizes.
Then it prints a summary of your results and the points you have earned. To run the driver, run this command at the terminal:
python driver.py
Extra credit will be awarded for code that results in significantly fewer misses than the full score "baselines."
For reference, there exist solutions with m as low as 300 (32 x 32), 1200 (64 x 64), and 1920 (67 x 61).
We have provided you with an autograding program, called test-trans.c
that tests the correctness
and performance of each of the transpose functions that you have registered with the autograder.
You can register up to 100 versions of the transpose function in your trans.c
file. Eache transpose version has the
following form:
/* Header comment */
char trans_simple_desc[] = "A simple transpose";
void trans_simple(int M, int N, int A[N][M], int B[M][N])
{
/* your transpose code here */
}
Register a particular transpose function witht he autograder by making a call of the form:
registerTransFunction(trans_simple, trans_simple_desc);
in the registerFunctions
routine in trans.c
. At runtime, the autograder will evaluate each registered
transpose function and print the results. Of course, one of the registered functions must be the
transpose_submit
function that you are submitting for credit:
registerTransFunction(transpose_submit, transpose_submit_desc);
See the default trans.c
function for an example of how this works.
The autograder takes the matrix size as input. It uses valgrind
to generate a trace of each registered transpose
function. It then evaluates each trace by running a chache simulator called csim-ref
on a cache with
parameters s=5, E=1, b=5
,
though the csim-ref uses the equlivent notation from the book s=5, E=1, b=5
.
For example, to test your registered transpose functions on a 32 x 32 matrix, rebuild test-trans
, and then run it
with the appropriate values for M and N:
linux> make
linux> ./test-trans -M 32 -N 32
Step 1: Evaluating registered transpose funcs for correctness:
func 0 (Transpose submission): correctness: 1
8
func 1 (Simple row-wise scan transpose): correctness: 1
func 2 (column-wise scan transpose): correctness: 1
func 3 (using a zig-zag access pattern): correctness: 1
Step 2: Generating memory traces for registered transpose funcs.
Step 3: Evaluating performance of registered transpose funcs (s=5, E=1, b=5)
func 0 (Transpose submission): hits:1766, misses:287, evictions:255
func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151
func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151
func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945
Summary for official submission (func 0): correctness=1 misses=287
In this example, we have registered four different transpose functions in trans.c
. The test-trans
program tests each of the registered functions, displays the results for each, and extracts the results for the
official submission.
The test-trans
program saves the trace for the function i in the file trace.fi
. These
trace files are invaluable debugging tools that can help you understand exactly where the hits and misses for each transpose function are coming
from. To debug a particular function, simply run its trace through the reference simulator with the verbose option:
linux> ./csim-ref -v -s 5 -E 1 -b 5 -t trace.f0
S 68312c,1 miss
L 683140,8 miss
L 683124,4 hit
L 683120,4 hit
L 603124,4 miss eviction
S 6431a0,4 miss
...
One use of knowing/looking at assembly is understanding what's actually going on, since some “magic” happens between your source code and executable.
Go back to lab0.c
and revert the nested for loops
in part4()
to the “ijk”
ordering. Compile lab0.c
with the following two
optimization levels and compare the dissassembly
for part4()
:
$ gcc -g -std=c99 -o lab0 lab0.c
$ gcc -g -O2 -std=c99 -o lab0opt lab0.c
Submit the following files:
cache-test-skel.c
trans.c
lab4reflect.txt
Make sure that you have included the geometry of the four mystery caches in a comment in cache-test-skel.c
and your final transpose code is in a function named transpose_submit
.
Once you're satisfied with your solutions, submit them via Canvas (link is at the top of this page).