Lab 4: Caches and Cache-Friendly Code
Overview
Learning Objectives:
- Given a provided cache, determine the cache size, block size, and associativity.
- Explain how the different cache parameters impact cache performance.
- Given a set of cache parameters and a task requiring frequent memory accesses, devise a function to complete the task with as few cache misses as possible.
- Given two functions performing the same task in different ways, compare and contrast how each method will perform as a result cache properties.
Note: is designed to help give you intuition about caches that should be useful throughout this lab, particularly for Part 2.
In the first part of this lab, Chip D. Signer, Ph.D., is trying to reverse engineer a competitor's microprocessors to discover their cache geometries and has recruited you to help. Instead of running programs on these processors and inferring the cache layout from timing results, you will approximate his work by using a simulator.
The second part of this lab involves trying to design a matrix transpose function that has as few cache misses as possible.
Code for this lab
Browser: | |
Terminal: | wget https://courses.cs.washington.edu/courses/cse351/24su/labs/lab4.tar.gz |
Unzip: | Running tar xzf lab4.tar.gz from the terminal
will extract the lab files to a directory called
lab4 with the following files:
Part 1
Part 2
Both Parts
Synthesis
|
Lab 4 Part 1 Instructions
Inferring Mystery Cache Geometries [18 points]
Each of the "processors" is provided as an object file
(.o
file) against which you will link your code.
The file mystery-cache.h
defines the function interface
that these object files export.
It includes a typedef for a type addr_t
(an unsigned
8-byte integer) which is what these (pretend) caches use for
"addresses", or you can use any convenient integer type.
typedef unsigned long long addr_t; typedef unsigned char bool_t; #define TRUE 1 #define FALSE 0 /** Lookup an address in the cache. Returns TRUE if the access hits, FALSE if it misses. */ bool_t access_cache(addr_t address); /** Clears all words in the cache. Useful for helping you reason about the cache transitions, by starting from a known state. */ void flush_cache(void);
Your job is to complete the 3 functions in
cache-test-skel.c
that have
/* YOUR CODE GOES HERE */
comments in them, which, when
linked with one of these cache object files, will determine and then
output the (1) block size, (2) cache size, and (3) associativity.
You will use the functions above to perform cache accesses and use
your observations of which ones hit and miss to determine the
parameters of four mystery caches.
- You can assume that all of the caches have sizes that are powers of 2 and use a least recently used (LRU) replacement policy.
- You cannot assume anything else about the cache parameters except what you can infer from the cache size.
flush_cache
and access_cache
inside of your functions.
For example, you may not call get_block_size()
inside of get_cache_assoc
.
Part I Testing
Recall that you will be determining the (1) block size, (2) cache
size, and (3) associativity.
To help sanity check your functions, some of the provided object
files are named with this information (e.g.,
cache_65536c_2e_16k.o
is a 65536 Byte
capacity, 2-way
set-associative cache with 16
Byte blocks).
The provided Makefile
includes a target
cache-test.
To use it, set TEST_CACHE to one of the provided object
files.
For example, the commands:
$ make cache-test TEST_CACHE=caches/cache_65536c_2e_16k.o $ ./cache-test
will create and then run an executable cache-test
that
connects your cache-inference code to the cache object with known
parameters mentioned above.
As you implement the functions in cache-test-skel.c
you
may want to test each function individually.
To do so, give one or more of the arguments
{block_size/size/assoc}
to the executable; examples
below:
$ ./cache-test block_size $ ./cache-test size $ ./cache-test assoc $ ./cache-test block_size size
Do keep in mind that the size
option depends on the
block_size
option working properly and that the
assoc
option likewise depends on the size
option working properly.
cache-test
executable doesn't produce any
output/results, your code likely entered an infinite loop.
You can use Ctrl-c to end the process and then start debugging.
For your convenience, we have provided a script called
testCache.sh
that will test ALL of the object files,
one-by-one.
It is run with the command:
$ bash testCache.sh
You may also give the testCache.sh
script the {block_size/size/assoc}
parameters to show you outputs for only specific functions. Example:
$ bash testCache.sh block_size size
Lab 4 Part 2 Instructions
Optimizing Matrix Transpose [10 points]
An example transpose function that computes the transpose of M x N
matrix A and stores the results in N x M matrix B
is provided in trans.c
.
It is functionally correct, but inefficient because the access
pattern results in relatively many cache misses:
char trans_desc[] = "Simple row-wise scan transpose"; void trans(int M, int N, int A[M][N], int B[N][M])
Your job in Part 2 is to write a similar function, called
transpose_submit
, that minimizes the number of cache
misses across different sized matrices on a simulated cache with
parameters s = 5
(index bits), E = 1
(associativity), k = 5
(offset bits):
char trans_desc[] = "Transpose submission"; void transpose_submit(int M, int N, int A[M][N], int B[N][M])
"Transpose submission"
for your
transpose_submit
function.
The autograder searches for this string to determine which transpose
function to evaluate for credit.
Programming Rules for Part 2
- Your code in
trans.c
must compile without warnings to receive credit. - Your transpose function may not use recursion
or any variant of
malloc
. - Your transpose function may not modify array A. You may, however, do whatever you want with the contents of array B.
- You are allowed to define at most 12 local
variables of type
int
per transpose function. This means 12 variables in scope at once. For example, the following code would be allowed:/* Note: function arguments are not counted as local variables */ void func_ok(int x, int y) { int array1[8]; if (x == y) { int array2[4]; // OK: 12 ints in scope } else { int array3[2]; // OK: 10 ints in scope } }
But the following is not:void func_bad(int x, int y) { int array1[8]; if (x == y) { int array2[2]; // OK: 10 ints in scope if (x > 0) { int array3[4]; // BAD: 14 ints in scope // Even though this branch may not always run, // the MAXIMUM number of local ints exceeds 12. } } }
The reason for this restriction is that our testing code is not able to count references to the stack. We want you to limit your references to the stack and focus on the access patterns of the source and destination arrays. - Your are permitted to use int arrays, but arrays DO count towards your 12 local ints. For example, if you have already declared 4 ints, you could declare an array of ints up to size 8.
- If you choose to use helper functions, you may not have more than 12 local variables on the stack at a time between your helper functions and your top level transpose function. For example, if your transpose declares 8 variables, and then you call a function which uses 4 variables, which calls another function which uses 2, you will have 14 variables on the stack, and you will be in violation of the rule.
- You are not allowed to side-step the previous
rule by using any variables of type
long
or by using any bit tricks to store more than one value to a single variable.
Performance and Grading
We will evaluate the correctness and performance of your
transpose_submit
function on a cache with parameters
s = 5
(index bits), E = 1
(associativity),
k = 5
(offset bits) for 2 different-sized input
matrices:
- 32 x 32 (M=32, N=32)
- 64 x 64 (M=64, N=64)
Your code must be functionally correct (i.e. produce the correct transposed matrix) to receive any performance points for a particular size. Your performance score for each matrix size scales linearly with the number of misses, m, up to some threshold:
- 32 x 32: 5 points if m ≤ 288, 0 points if m > 600
- 64 x 64: 5 points if m ≤ 1700, 0 points if m > 2400
Extra Credit
Extra credit will be awarded for code that results in fewer misses than the full score "baselines." For reference, there exist solutions with m as low as 259 (32 x 32) and 1052 (64 x 64).
Part 2 Testing
Transpose Testing (test-trans
)
You can check the performance and correctness scores of your
transpose function(s) by running test-trans
, an
executable generated by the given Makefile
.
Your final submission must go in the given
transpose_submit
function.
However, we provide the ability to try out multiple implementations;
see below for more details.
To test your code for a particular size:
$ make # run make each time you change your code $ ./test-trans -M 32 -N 32 Function 0 (1 total) Step 1: Validating and generating memory traces Step 2: Evaluating performance (s=5, E=1, b=5) func 0 (Transpose submission): hits:1766, misses:288, evictions:255 Summary for official submission (func 0): correctness=1 misses=288
The test program ran the solution function and printed out two pieces of information:
- The correctness of the solution. If 0, that means your function did not correctly compute the transposition of the matrix.
- The hit, miss and eviction counts. Your score is based only on the number of misses – lower is better.
To analyze where the cache hits and misses are coming from, you will
need to examine saved trace files.
test-trans
saves the trace for the function i
in the file trace.fi
.
To debug a particular function, simply run its trace through the
reference simulator with the verbose option:
$ ./support/csim-ref -v -s 5 -E 1 -b 5 -t trace.f0 S 68312c,1 miss L 683140,8 miss L 683124,4 hit L 683120,4 hit L 603124,4 miss eviction S 6431a0,4 miss ...
Each line denotes one or two memory accesses, using the following format:
operation address,size result
- The operation field denotes the type of memory access: "I" denotes an instruction load, "L" a data load, "S" a data store, and "M" a data modify (i.e., a data load followed by a data store). There is never a space before each "I". There is always a space before each "M", "L", and "S".
- The address field specifies a 64-bit hexadecimal memory address.
- The size field specifies the number of bytes accessed by the operation.
- The result field specifies the result of the memory access with regards to the cache.
Registering Multiple Transpose Functions
You can register up to 100 versions of the transpose function in
your trans.c
file for use in test-trans.c
.
Each transpose version has the following form:
/* Header comment */ char trans_simple_desc[] = "A simple transpose"; void trans_simple(int M, int N, int A[M][N], int B[N][M]) { /* your transpose code here */ }
To register a particular transpose function with the autograder,
make a call in the registerFunctions
routine in
trans.c
of the form:
registerTransFunction(trans_simple, trans_simple_desc);
At runtime, the autograder will evaluate each registered transpose
function and print the results.
Of course, one of the registered functions must be the
transpose_submit
function that you are submitting for
credit:
registerTransFunction(transpose_submit, transpose_submit_desc);
See the default trans.c
function for an example of how
this works.
The following example has four different transpose functions registered and runs on a 32 x 32 matrix. Note that while it displays the results for each registered function, it still extracts the results for the official submission at the end.
$ make # run make each time you change your code $ ./test-trans -M 32 -N 32 Step 1: Evaluating registered transpose funcs for correctness: func 0 (Transpose submission): correctness: 1 func 1 (Simple row-wise scan transpose): correctness: 1 func 2 (column-wise scan transpose): correctness: 1 func 3 (using a zig-zag access pattern): correctness: 1 Step 2: Generating memory traces for registered transpose funcs. Step 3: Evaluating performance of registered transpose funcs (s (Index Bits) = 5, E (Associativity) = 1, k (Offset Bits) = 5) func 0 (Transpose submission): hits:1766, misses:288, evictions:255 func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151 func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151 func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945 Summary for official submission (func 0): correctness=1 misses=288
Lab Score (grade_trans.py
)
We have provided you with a driver program, called
grade_trans.py
, that performs an evaluation of your
transpose code's correctness and performance for both matrix
sizes – this is the same program your instructor will use to
evaluate your submission!
To run the driver, type:
$ make # run make each time you change your code $ python3 grade_trans.py Part B: Testing transpose function Running ./test-trans -M 32 -N 32 Running ./test-trans -M 64 -N 64 Cache Lab summary: Points Max pts Misses Trans perf 32x32 5.00 5 288 Trans perf 64x64 5.00 5 1107 Total points 10.00 1
Rules Compliance (check_trans_vars
)
As stated above, you will recieve NO POINTS for Part 2 if you violate the programming rules. To check for violations, run the following command:
$ ./check_trans_vars trans.c
If your code passes the checks, you will see the following:
$ ./check_trans_vars trans.c Variables used by transpose_submit: 2 transpose_submit passes the programming rules checks.
If it fails, you will see something like this:
$ ./check_trans_vars trans.c transpose_submit DOES NOT PASS the programming rules checks: - Maximum variable count exceeded (transpose_submit: 14) - Disallowed variable type(s) used: long
Resources
Cache Blocking
The cache blocking technique is introduced in , so you can refer to those slides and recording. You can find different explanations in and (using the term "Loop Blocking", instead). Here is an that may be useful too.
Cache Simulation
The is available to help you get practice with and visualize different cache operations and mechanics. See the associated for more information.
Matrix Visualization
The can help you visualize how matrices are laid out in memory, which may come in handy for Part 2. Just select the parameters of your matrix and you can hover over each matrix element and see its address.
The matrix visualizer purposely restricts you to small sizes to keep things digestible. If you'd like to see a more "realistic" visualization of the matrices for Part 2, see , which shows the last 2 bytes of the address of each cache block as they align to the 32x32 matricies. Red squares notate the start of a section that would fit in cache. Green lines divide individual integers, and black lines divide cache blocks.
Tips and Hints
Performance Tips
Struggling to reduce your misses? Consider the following:
- Only memory accesses to the heap are monitored. Just be careful not to break any programming rules!
- Since your transpose function is being evaluated on a direct-mapped cache, conflict misses are a potential problem. Think about the potential for conflict misses in your code, especially along the diagonal. Try to think of access patterns that will decrease the number of these conflict misses.
Programming Rules Tips
Algorithm breaking the programming rules? Consider the following:
- If you have variables specific to one case of an "if", you can declare the variable within that block and it will only count against your variable count for code in that part of the "if". Remember — you're only limited by the number of variables that exist at one time. You can declare variables within "if" statements and loops. This is especially helpful if you have different variables for the two matrix sizes.
- Use
#define
! These don't count against your variable total. You can declare constants using#define
and use them wherever you want. - Always use a constant integer size when declaring
arrays.
The variable rules checker will fail if you don't do this.
For example, instead of this:
int size = 5; int array[size]; // error: trans.c:24:9: Unable to process due to non-constant value "size" used in array declaration. // Please use a constant integer as the array dimension.
Do this:int array[5]; // OK!
Lab 4 Synthesis Questions [11 points]
synthesis/
directory of your lab, you will find
lab0.c
, along with two other files:
lab0.d
: The assembly of compiledlab0.c
, without any optimizations done by the compiler.lab0opt.d
: The assembly of compiledlab0.c
, with optimizations done by the compiler (compiled with the flag-O2
).
DO NOT ATTEMPT TO REGENERATE THESE FILES YOURSELF! If you do, your assembly output may differ from ours, especially if you don't compile on the CSE Linux environment!
One use of knowing/looking at assembly is understanding what's actually going on, since some "magic" happens between your source code and executable. While doing the synthesis, we expect you will read the assembly, rather than use GDB, to answer the following questions.
- In
part_4
, the nested loops come between the calls toclock()
(callq <clock@plt>
in the assembly). How many of the assembly instructions in the nested loops access (read or write to) memory? Provide the count and list of instructions for both the unoptimized and optimized code. For the lists of instructions, you can either copy and paste the instructions or include the relative address of each instruction (<func+##>
). [2 pt]- Only count each instruction once; ignore the looping mechanics and multiple accesses in a single instruction.
nop*
is a family of instructions that do nothing (i.e., they don't access memory).- Don't count the instructions found in
big_array_index
, but thecallq <big_array_index>
instruction is part of the nested loops. - Only count the number of data accesses caused by the execution of the instructions and not by their retrieval.
- In the unoptimized code, where are the
loop variables
i
,j
, andk
"stored?" Give the corresponding assembly operands (e.g.,%r8
,2(%rax)
) and not addresses. [3 pt] - In the optimized code, the loop variables
have sort of disappeared!
Name what the values found in registers
%edx
and%rcx
correspond to in the C code. [2 pt] - Take a look at your transpose code for the 32 x 32 matrix. If the starting address of matrix A did not map to the beginning of a cache block, how would the number of misses be affected? (Increase / Stay the Same / Decrease) Briefly explain. [2 pt]
- Take a look at your transpose code for the
64 x 64 matrix.
If you were to run this code for a matrix of size 57 x 71, how
would the number of misses be affected?
(Increase / Stay the Same / Decrease)
Briefly explain. [2 pt]
- You don’t need to modify your code to work with this matrix size. Use what you know about cache images to answer this.
Submission
You will submit:
cache-test-skel.c
,
trans.c
,
and
lab4synthesis.txt
.
Make sure that your final transpose code is in a function named
transpose_submit
and that you've double-check your
adherence to the programming rules with
./check_trans_vars trans.c
.
After submitting, please wait until the autograder is done running and double-check that you passed the "File Check" and "Compilation and Execution Issues" tests. If either test returns a score of -1, be sure to read the output and fix any problems before resubmitting. Failure to do so will result in a programming score of ZERO for the lab.
Submit your files to the "Lab 4" assignment on
.
Don't forget to add your partner, if you have one.
If you completed the extra credit, also submit
trans.c
to the "Lab 4 Extra Credit" assignment.