Note: Homework 19 is designed to help give you intuition about caches that should be useful throughout this lab, particularly for Part 2.
In the first part of this lab, Chip D. Signer, Ph.D., is trying to reverse engineer a competitor's microprocessors to discover their cache geometries and has recruited you to help. Instead of running programs on these processors and inferring the cache layout from timing results, you will approximate his work by using a simulator.
The second part of this lab involves trying to design a matrix transpose function that has as few cache misses as possible.
Each of the "processors" is provided as an object file
(.o
file) against which you will link your code.
The file mystery-cache.h
defines the function interface
that these object files export.
It includes a typedef for a type addr_t
(an unsigned
8-byte integer) which is what these (pretend) caches use for
"addresses", or you can use any convenient integer type.
typedef unsigned long long addr_t; typedef unsigned char bool_t; #define TRUE 1 #define FALSE 0 /** Lookup an address in the cache. Returns TRUE if the access hits, FALSE if it misses. */ bool_t access_cache(addr_t address); /** Clears all words in the cache. Useful for helping you reason about the cache transitions, by starting from a known state. */ void flush_cache(void);
Your job is to complete the 3 functions in
cache-test-skel.c
that have
/* YOUR CODE GOES HERE */
comments in them, which, when
linked with one of these cache object files, will determine and then
output the (1) block size, (2) cache size, and (3) associativity.
You will use the functions above to perform cache accesses and use
your observations of which ones hit and miss to determine the
parameters of four mystery caches.
flush_cache
and access_cache
inside of your functions.
For example, you may not call get_block_size()
inside of get_cache_assoc
.
Recall that you will be determining the (1) block size, (2) cache
size, and (3) associativity.
To help sanity check your functions, some of the provided object
files are named with this information (e.g.,
cache_65536c_2e_16k.o
is a 65536 Byte
capacity, 2-way
set-associative cache with 16
Byte blocks).
The provided Makefile
includes a target
cache-test.
To use it, set TEST_CACHE to one of the provided object
files.
For example, the commands:
$ make cache-test TEST_CACHE=caches/cache_65536c_2e_16k.o $ ./cache-test
will create and then run an executable cache-test
that
connects your cache-inference code to the cache object with known
parameters mentioned above.
As you implement the functions in cache-test-skel.c
you
may want to test each function individually.
To do so, give one or more of the arguments
{block_size/size/assoc}
to the executable; examples
below:
$ ./cache-test block_size $ ./cache-test size $ ./cache-test assoc $ ./cache-test block_size size
Do keep in mind that the size
option depends on the
block_size
option working properly and that the
assoc
option likewise depends on the size
option working properly.
cache-test
executable doesn't produce any
output/results, your code likely entered an infinite loop.
You can use Ctrl-c to end the process and then start debugging.
For your convenience, we have provided a script called
testCache.sh
that will test ALL of the object files,
one-by-one.
It is run with the command:
$ bash testCache.sh
You may also give the testCache.sh
script the {block_size/size/assoc}
parameters to show you outputs for only specific functions. Example:
$ bash testCache.sh block_size size
An example transpose function that computes the transpose of M x N
matrix A and stores the results in N x M matrix B
is provided in trans.c
.
It is functionally correct, but inefficient because the access
pattern results in relatively many cache misses:
char trans_desc[] = "Simple row-wise scan transpose"; void trans(int M, int N, int A[M][N], int B[N][M])
Your job in Part 2 is to write a similar function, called
transpose_submit
, that minimizes the number of cache
misses across different sized matrices on a simulated cache with
parameters s = 5
(index bits), E = 1
(associativity), k = 5
(offset bits):
char trans_desc[] = "Transpose submission"; void transpose_submit(int M, int N, int A[M][N], int B[N][M])
"Transpose submission"
for your
transpose_submit
function.
The autograder searches for this string to determine which transpose
function to evaluate for credit.
trans.c
must compile
without warnings to receive credit.malloc
.int
per transpose function.
This means 12 variables in scope at once.
For example, the following code would be allowed:
/* Note: function arguments are not counted as local variables */ void func_ok(int x, int y) { int array1[8]; if (x == y) { int array2[4]; // OK: 12 ints in scope } else { int array3[2]; // OK: 10 ints in scope } }But the following is not:
void func_bad(int x, int y) { int array1[8]; if (x == y) { int array2[2]; // OK: 10 ints in scope if (x > 0) { int array3[4]; // BAD: 14 ints in scope // Even though this branch may not always run, // the MAXIMUM number of local ints exceeds 12. } } }The reason for this restriction is that our testing code is not able to count references to the stack. We want you to limit your references to the stack and focus on the access patterns of the source and destination arrays.
long
or by using
any bit tricks to store more than one value to a single
variable.
We will evaluate the correctness and performance of your
transpose_submit
function on a cache with parameters
s = 5
(index bits), E = 1
(associativity),
k = 5
(offset bits) for 2 different-sized input
matrices:
Your code must be functionally correct (i.e. produce the correct transposed matrix) to receive any performance points for a particular size. Your performance score for each matrix size scales linearly with the number of misses, m, up to some threshold:
Extra credit will be awarded for code that results in fewer misses than the full score "baselines." For reference, there exist solutions with m as low as 259 (32 x 32) and 1052 (64 x 64).
test-trans
)
You can check the performance and correctness scores of your
transpose function(s) by running test-trans
, an
executable generated by the given Makefile
.
Your final submission must go in the given
transpose_submit
function.
However, we provide the ability to try out multiple implementations;
see below for more details.
To test your code for a particular size:
$ make # run make each time you change your code $ ./test-trans -M 32 -N 32 Function 0 (1 total) Step 1: Validating and generating memory traces Step 2: Evaluating performance (s=5, E=1, b=5) func 0 (Transpose submission): hits:1766, misses:288, evictions:255 Summary for official submission (func 0): correctness=1 misses=288
The test program ran the solution function and printed out two pieces of information:
To analyze where the cache hits and misses are coming from, you will
need to examine saved trace files.
test-trans
saves the trace for the function i
in the file trace.fi
.
To debug a particular function, simply run its trace through the
reference simulator with the verbose option:
$ ./support/csim-ref -v -s 5 -E 1 -b 5 -t trace.f0 S 68312c,1 miss L 683140,8 miss L 683124,4 hit L 683120,4 hit L 603124,4 miss eviction S 6431a0,4 miss ...
Each line denotes one or two memory accesses, using the following format:
operation address,size result
You can register up to 100 versions of the transpose function in
your trans.c
file for use in test-trans.c
.
Each transpose version has the following form:
/* Header comment */ char trans_simple_desc[] = "A simple transpose"; void trans_simple(int M, int N, int A[M][N], int B[N][M]) { /* your transpose code here */ }
To register a particular transpose function with the autograder,
make a call in the registerFunctions
routine in
trans.c
of the form:
registerTransFunction(trans_simple, trans_simple_desc);
At runtime, the autograder will evaluate each registered transpose
function and print the results.
Of course, one of the registered functions must be the
transpose_submit
function that you are submitting for
credit:
registerTransFunction(transpose_submit, transpose_submit_desc);
See the default trans.c
function for an example of how
this works.
The following example has four different transpose functions registered and runs on a 32 x 32 matrix. Note that while it displays the results for each registered function, it still extracts the results for the official submission at the end.
$ make # run make each time you change your code $ ./test-trans -M 32 -N 32 Step 1: Evaluating registered transpose funcs for correctness: func 0 (Transpose submission): correctness: 1 func 1 (Simple row-wise scan transpose): correctness: 1 func 2 (column-wise scan transpose): correctness: 1 func 3 (using a zig-zag access pattern): correctness: 1 Step 2: Generating memory traces for registered transpose funcs. Step 3: Evaluating performance of registered transpose funcs (s (Index Bits) = 5, E (Associativity) = 1, k (Offset Bits) = 5) func 0 (Transpose submission): hits:1766, misses:288, evictions:255 func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151 func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151 func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945 Summary for official submission (func 0): correctness=1 misses=288
grade_trans.py
)
We have provided you with a driver program, called
grade_trans.py
, that performs an evaluation of your
transpose code's correctness and performance for both matrix
sizes – this is the same program your instructor will use to
evaluate your submission!
To run the driver, type:
$ make # run make each time you change your code $ python3 grade_trans.py Part B: Testing transpose function Running ./test-trans -M 32 -N 32 Running ./test-trans -M 64 -N 64 Cache Lab summary: Points Max pts Misses Trans perf 32x32 5.00 5 288 Trans perf 64x64 5.00 5 1107 Total points 10.00 1
check_trans_vars
)As stated above, you will recieve NO POINTS for Part 2 if you violate the programming rules. To check for violations, run the following command:
$ ./check_trans_vars trans.c
If your code passes the checks, you will see the following:
$ ./check_trans_vars trans.c Variables used by transpose_submit: 2 transpose_submit passes the programming rules checks.
If it fails, you will see something like this:
$ ./check_trans_vars trans.c transpose_submit DOES NOT PASS the programming rules checks: - Maximum variable count exceeded (transpose_submit: 14) - Disallowed variable type(s) used: long
The cache blocking technique is introduced in Lecture 19, so you can refer to those slides and recording. You can find different explanations in and (using the term "Loop Blocking", instead).
The is available to help you get practice with and visualize different cache operations and mechanics. See the associated for more information.
The can help you visualize how matrices are laid out in memory, which may come in handy for Part 2. Just select the parameters of your matrix and you can hover over each matrix element and see its address.
The matrix visualizer purposely restricts you to small sizes to keep things digestible. If you'd like to see a more "realistic" visualization of the matrices for Part 2, see , which shows the last 2 bytes of the address of each cache block as they align to the 32x32 matricies. Red squares notate the start of a section that would fit in cache. Green lines divide individual integers, and black lines divide cache blocks.
Struggling to reduce your misses? Consider the following:
Algorithm breaking the programming rules? Consider the following:
#define
!
These don't count against your variable total.
You can declare constants using #define
and use them
wherever you want.int size = 5; int array[size]; // error: trans.c:24:9: Unable to process due to non-constant value "size" used in array declaration. // Please use a constant integer as the array dimension.Do this:
int array[5]; // OK!
synthesis/
directory of your lab, you will find
lab0.c
, along with two other files:
lab0.d
: The assembly of compiled
lab0.c
, without any optimizations done by the
compiler.lab0opt.d
: The assembly of compiled
lab0.c
, with optimizations done by the compiler
(compiled with the flag -O2
).DO NOT ATTEMPT TO REGENERATE THESE FILES YOURSELF! If you do, your assembly output may differ from ours, especially if you don't compile on the CSE Linux environment!
One use of knowing/looking at assembly is understanding what's actually going on, since some "magic" happens between your source code and executable. While doing the synthesis, we expect you will read the assembly, rather than use GDB, to answer the following questions.
part_4
, the nested loops come between the calls
to clock()
(callq <clock@plt>
in
the assembly).
How many of the assembly instructions in the nested loops access
(read or write to) memory?
Provide the count and list of instructions for both the
unoptimized and optimized code.
For the lists of instructions, you can either copy and paste the
instructions or include the relative address of each instruction
(<func+##>
). [2 pt]
nop*
is a family of instructions that do nothing
(i.e., they don't access memory).big_array_index
, but the
callq <big_array_index>
instruction is part
of the nested loops.i
, j
, and k
"stored?"
Give the corresponding assembly operands (e.g.,
%r8
, 2(%rax)
) and not addresses.
[3 pt]%edx
and
%rcx
correspond to in the C code. [2 pt]
You will submit:
cache-test-skel.c
,
trans.c
,
and
lab4synthesis.txt
.
Make sure that your final transpose code is in a function named
transpose_submit
and that you've double-check your
adherence to the programming rules with
./check_trans_vars trans.c
.
After submitting, please wait until the autograder is done running and double-check that you passed the "File Check" and "Compilation and Execution Issues" tests. If either test returns a score of -1, be sure to read the output and fix any problems before resubmitting. Failure to do so will result in a programming score of ZERO for the lab.
Submit your files to the "Lab 4" assignment on
.
Don't forget to add your partner, if you have one.
If you completed the extra credit, also submit
trans.c
to the "Lab 4 Extra Credit" assignment.