Lab 4: Caches and Cache-Friendly Code

Assigned

Monday, Jul 29, 2024

Due Date

Wednesday, Aug 7, 2024 at 11:59 pm

Video(s)

Overview

Learning Objectives:

Given a provided cache, determine the cache size, block size, and associativity.
Explain how the different cache parameters impact cache performance.
Given a set of cache parameters and a task requiring frequent memory accesses, devise a function to complete the task with as few cache misses as possible.
Given two functions performing the same task in different ways, compare and contrast how each method will perform as a result cache properties.

Note: is designed to help give you intuition about caches that should be useful throughout this lab, particularly for Part 2.

In the first part of this lab, Chip D. Signer, Ph.D., is trying to reverse engineer a competitor's microprocessors to discover their cache geometries and has recruited you to help. Instead of running programs on these processors and inferring the cache layout from timing results, you will approximate his work by using a simulator.

The second part of this lab involves trying to design a matrix transpose function that has as few cache misses as possible.

Code for this lab

Browser:

Terminal: wget https://courses.cs.washington.edu/courses/cse351/24su/labs/lab4.tar.gz

Unzip:

Running tar xzf lab4.tar.gz from the terminal will extract the lab files to a directory called lab4 with the following files:

Part 1

cache-test-skel.c - Skeleton code for determining cache parameters
testCache.sh - Script that makes and runs cache-test with all the cache object files
caches/cache_*.o - "Caches" with known parameters for testing your code
caches/mystery*.o - "Caches" with unknown parameters that you will need to discover
support/mystery-cache.h - Defines the function interface that the object files export

Part 2

trans.c - Place for your transpose functions
grade_trans.py - Script to run all Part 2 performance tests and calculate score
check_trans_vars - Script to check for compliance with the lab4 variable usage rules
support/test-trans.c - Code to test efficiency of your functions in trans.c
support/tracegen.c - Used by test-trans.c to generate memory traces
support/csim-ref - Cache Simulator used by test-trans
support/cachelab.* - Helper code used by other programs in Part 2

Both Parts

Makefile - For compiling and linking files

Synthesis

synthesis/lab4synthesis.txt - For your Synthesis Question responses
synthesis/lab0.c: A copy of lab0 for your convenience.
synthesis/lab0.d: The assembly of compiled lab0.c, without any optimizations done by the compiler.
synthesis/lab0opt.d: The assembly of compiled lab0.c, with optimizations done by the compiler (compiled with the flag -O2).

Lab 4 Part 1 Instructions

Inferring Mystery Cache Geometries [18 points]

Each of the "processors" is provided as an object file (.o file) against which you will link your code. The file mystery-cache.h defines the function interface that these object files export. It includes a typedef for a type addr_t (an unsigned 8-byte integer) which is what these (pretend) caches use for "addresses", or you can use any convenient integer type.

typedef unsigned long long addr_t;
typedef unsigned char bool_t;
#define TRUE 1
#define FALSE 0

/** Lookup an address in the cache. Returns TRUE if the access hits,
    FALSE if it misses. */
bool_t access_cache(addr_t address);

/** Clears all words in the cache.
    Useful for helping you reason about the cache
    transitions, by starting from a known state. */
void flush_cache(void);

Your job is to complete the 3 functions in cache-test-skel.c that have /* YOUR CODE GOES HERE */ comments in them, which, when linked with one of these cache object files, will determine and then output the (1) block size, (2) cache size, and (3) associativity. You will use the functions above to perform cache accesses and use your observations of which ones hit and miss to determine the parameters of four mystery caches.

You can assume that all of the caches have sizes that are powers of 2 and use a least recently used (LRU) replacement policy.
You cannot assume anything else about the cache parameters except what you can infer from the cache size.

You should NOT be calling any functions other than flush_cache and access_cache inside of your functions. For example, you may not call get_block_size() inside of get_cache_assoc.

Part I Testing

Recall that you will be determining the (1) block size, (2) cache size, and (3) associativity. To help sanity check your functions, some of the provided object files are named with this information (e.g., cache_65536c_2e_16k.o is a 65536 Byte capacity, 2-way set-associative cache with 16 Byte blocks).

The provided Makefile includes a target cache-test. To use it, set TEST_CACHE to one of the provided object files. For example, the commands:

$ make cache-test TEST_CACHE=caches/cache_65536c_2e_16k.o
$ ./cache-test

will create and then run an executable cache-test that connects your cache-inference code to the cache object with known parameters mentioned above.

As you implement the functions in cache-test-skel.c you may want to test each function individually. To do so, give one or more of the arguments {block_size/size/assoc} to the executable; examples below:

$ ./cache-test block_size
$ ./cache-test size
$ ./cache-test assoc
$ ./cache-test block_size size

Do keep in mind that the size option depends on the block_size option working properly and that the assoc option likewise depends on the size option working properly.

If the cache-test executable doesn't produce any output/results, your code likely entered an infinite loop. You can use Ctrl-c to end the process and then start debugging.

For your convenience, we have provided a script called testCache.sh that will test ALL of the object files, one-by-one. It is run with the command:

$ bash testCache.sh

You may also give the testCache.sh script the {block_size/size/assoc} parameters to show you outputs for only specific functions. Example:

$ bash testCache.sh block_size size

Lab 4 Part 2 Instructions

It will be useful for you to review Homework 17 (Ed Lessons) before starting on this part.

Optimizing Matrix Transpose [10 points]

Matrix Transpose: Let A denote a matrix, and A_ij denote the component in the ith row and jth column. The transpose of A, denotated A^T is a matrix such that for all i and j, A_ij = A^T_ji

An example transpose function that computes the transpose of M x N matrix A and stores the results in N x M matrix B is provided in trans.c. It is functionally correct, but inefficient because the access pattern results in relatively many cache misses:

char trans_desc[] = "Simple row-wise scan transpose";
void trans(int M, int N, int A[M][N], int B[N][M])

Your job in Part 2 is to write a similar function, called transpose_submit, that minimizes the number of cache misses across different sized matrices on a simulated cache with parameters s = 5 (index bits), E = 1 (associativity), k = 5 (offset bits):

char trans_desc[] = "Transpose submission";
void transpose_submit(int M, int N, int A[M][N], int B[N][M])

Do NOT change the description string "Transpose submission" for your transpose_submit function. The autograder searches for this string to determine which transpose function to evaluate for credit.

Programming Rules for Part 2

You will recieve NO POINTS for Part 2 if you violate the following rules. As with Lab 1a, we provide you the script we use to check programming rule correctness for Part 2. See the for more information.

Your code in trans.c must compile without warnings to receive credit.
Your transpose function may not use recursion or any variant of malloc.
Your transpose function may not modify array A. You may, however, do whatever you want with the contents of array B.

You are allowed to define at most 12 local variables of type int per transpose function. This means 12 variables in scope at once. For example, the following code would be allowed:

/* Note: function arguments are not counted as local variables */
void func_ok(int x, int y) {
    int array1[8];
    if (x == y) {
        int array2[4];  // OK: 12 ints in scope
    } else {
        int array3[2];  // OK: 10 ints in scope
    }
}

But the following is not:

void func_bad(int x, int y) {
    int array1[8];
    if (x == y) {
        int array2[2];  // OK: 10 ints in scope
        if (x > 0) {
            int array3[4];  // BAD: 14 ints in scope
                            // Even though this branch may not always run,
                            // the MAXIMUM number of local ints exceeds 12.
        }
    }
}

The reason for this restriction is that our testing code is not able to count references to the stack. We want you to limit your references to the stack and focus on the access patterns of the source and destination arrays.

Your are permitted to use int arrays, but arrays DO count towards your 12 local ints. For example, if you have already declared 4 ints, you could declare an array of ints up to size 8.
If you choose to use helper functions, you may not have more than 12 local variables on the stack at a time between your helper functions and your top level transpose function. For example, if your transpose declares 8 variables, and then you call a function which uses 4 variables, which calls another function which uses 2, you will have 14 variables on the stack, and you will be in violation of the rule.
You are not allowed to side-step the previous rule by using any variables of type long or by using any bit tricks to store more than one value to a single variable.

Performance and Grading

We will evaluate the correctness and performance of your transpose_submit function on a cache with parameters s = 5 (index bits), E = 1 (associativity), k = 5 (offset bits) for 2 different-sized input matrices:

32 x 32 (M=32, N=32)
64 x 64 (M=64, N=64)

Your code must be functionally correct (i.e. produce the correct transposed matrix) to receive any performance points for a particular size. Your performance score for each matrix size scales linearly with the number of misses, m, up to some threshold:

32 x 32: 5 points if m ≤ 288, 0 points if m > 600
64 x 64: 5 points if m ≤ 1700, 0 points if m > 2400

Your code only needs to be correct for these two cases and you can optimize it specifically for these two cases. In particular, it is perfectly OK for your function to explicitly check for the input sizes and implement separate code optimized for each case.

Extra Credit

Extra credit will be awarded for code that results in fewer misses than the full score "baselines." For reference, there exist solutions with m as low as 259 (32 x 32) and 1052 (64 x 64).

Part 2 Testing

Transpose Testing (`test-trans`)

You can check the performance and correctness scores of your transpose function(s) by running test-trans, an executable generated by the given Makefile. Your final submission must go in the given transpose_submit function. However, we provide the ability to try out multiple implementations; see below for more details.

To test your code for a particular size:

$ make # run make each time you change your code
$ ./test-trans -M 32 -N 32

Function 0 (1 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 0 (Transpose submission): hits:1766, misses:288, evictions:255

Summary for official submission (func 0): correctness=1 misses=288

The test program ran the solution function and printed out two pieces of information:

The correctness of the solution. If 0, that means your function did not correctly compute the transposition of the matrix.
The hit, miss and eviction counts. Your score is based only on the number of misses – lower is better.

To analyze where the cache hits and misses are coming from, you will need to examine saved trace files. test-trans saves the trace for the function i in the file trace.fi. To debug a particular function, simply run its trace through the reference simulator with the verbose option:

$ ./support/csim-ref -v -s 5 -E 1 -b 5 -t trace.f0
S 68312c,1 miss
L 683140,8 miss
L 683124,4 hit
L 683120,4 hit
L 603124,4 miss eviction
S 6431a0,4 miss
...

Each line denotes one or two memory accesses, using the following format:

operation address,size result

The operation field denotes the type of memory access: "I" denotes an instruction load, "L" a data load, "S" a data store, and "M" a data modify (i.e., a data load followed by a data store). There is never a space before each "I". There is always a space before each "M", "L", and "S".
The address field specifies a 64-bit hexadecimal memory address.
The size field specifies the number of bytes accessed by the operation.
The result field specifies the result of the memory access with regards to the cache.

Registering Multiple Transpose Functions

You can register up to 100 versions of the transpose function in your trans.c file for use in test-trans.c. Each transpose version has the following form:

/* Header comment */
char trans_simple_desc[] = "A simple transpose";
void trans_simple(int M, int N, int A[M][N], int B[N][M])
{
    /* your transpose code here */
}

To register a particular transpose function with the autograder, make a call in the registerFunctions routine in trans.c of the form:

registerTransFunction(trans_simple, trans_simple_desc);

At runtime, the autograder will evaluate each registered transpose function and print the results. Of course, one of the registered functions must be the transpose_submit function that you are submitting for credit:

registerTransFunction(transpose_submit, transpose_submit_desc);

See the default trans.c function for an example of how this works.

The following example has four different transpose functions registered and runs on a 32 x 32 matrix. Note that while it displays the results for each registered function, it still extracts the results for the official submission at the end.

$ make # run make each time you change your code
$ ./test-trans -M 32 -N 32
Step 1: Evaluating registered transpose funcs for correctness:
func 0 (Transpose submission): correctness: 1
func 1 (Simple row-wise scan transpose): correctness: 1
func 2 (column-wise scan transpose): correctness: 1
func 3 (using a zig-zag access pattern): correctness: 1
Step 2: Generating memory traces for registered transpose funcs.
Step 3: Evaluating performance of registered transpose funcs (s (Index Bits) = 5, E (Associativity) = 1, k (Offset Bits) = 5)
func 0 (Transpose submission): hits:1766, misses:288, evictions:255
func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151
func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151
func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945
Summary for official submission (func 0): correctness=1 misses=288

Lab Score (`grade_trans.py`)

We have provided you with a driver program, called grade_trans.py, that performs an evaluation of your transpose code's correctness and performance for both matrix sizes – this is the same program your instructor will use to evaluate your submission! To run the driver, type:

$ make # run make each time you change your code
$ python3 grade_trans.py
Part B: Testing transpose function
Running ./test-trans -M 32 -N 32
Running ./test-trans -M 64 -N 64

Cache Lab summary:
                        Points   Max pts      Misses
Trans perf 32x32          5.00         5         288
Trans perf 64x64          5.00         5        1107
          Total points   10.00         1

Rules Compliance (`check_trans_vars`)

As stated above, you will recieve NO POINTS for Part 2 if you violate the programming rules. To check for violations, run the following command:

$ ./check_trans_vars trans.c

If your code passes the checks, you will see the following:

$ ./check_trans_vars trans.c
Variables used by transpose_submit: 2
transpose_submit passes the programming rules checks.

If it fails, you will see something like this:

$ ./check_trans_vars trans.c
transpose_submit DOES NOT PASS the programming rules checks:
        - Maximum variable count exceeded (transpose_submit: 14)
        - Disallowed variable type(s) used: long

Resources

Cache Blocking

The cache blocking technique is introduced in , so you can refer to those slides and recording. You can find different explanations in and (using the term "Loop Blocking", instead). Here is an that may be useful too.

Cache Simulation

The is available to help you get practice with and visualize different cache operations and mechanics. See the associated for more information.

Matrix Visualization

The can help you visualize how matrices are laid out in memory, which may come in handy for Part 2. Just select the parameters of your matrix and you can hover over each matrix element and see its address.

The matrix visualizer purposely restricts you to small sizes to keep things digestible. If you'd like to see a more "realistic" visualization of the matrices for Part 2, see , which shows the last 2 bytes of the address of each cache block as they align to the 32x32 matricies. Red squares notate the start of a section that would fit in cache. Green lines divide individual integers, and black lines divide cache blocks.

Tips and Hints

Performance Tips

Struggling to reduce your misses? Consider the following:

Only memory accesses to the heap are monitored. Just be careful not to break any programming rules!
Since your transpose function is being evaluated on a direct-mapped cache, conflict misses are a potential problem. Think about the potential for conflict misses in your code, especially along the diagonal. Try to think of access patterns that will decrease the number of these conflict misses.

Programming Rules Tips

Algorithm breaking the programming rules? Consider the following:

If you have variables specific to one case of an "if", you can declare the variable within that block and it will only count against your variable count for code in that part of the "if". Remember — you're only limited by the number of variables that exist at one time. You can declare variables within "if" statements and loops. This is especially helpful if you have different variables for the two matrix sizes.
Use #define! These don't count against your variable total. You can declare constants using #define and use them wherever you want.

Always use a constant integer size when declaring arrays. The variable rules checker will fail if you don't do this. For example, instead of this:

int size = 5;
int array[size]; // error: trans.c:24:9: Unable to process due to non-constant value "size" used in array declaration.
                 // Please use a constant integer as the array dimension.

Do this:

int array[5]; // OK!

Lab 4 Synthesis Questions [11 points]

In the synthesis/ directory of your lab, you will find lab0.c, along with two other files:

lab0.d: The assembly of compiled lab0.c, without any optimizations done by the compiler.
lab0opt.d: The assembly of compiled lab0.c, with optimizations done by the compiler (compiled with the flag -O2).

DO NOT ATTEMPT TO REGENERATE THESE FILES YOURSELF! If you do, your assembly output may differ from ours, especially if you don't compile on the CSE Linux environment!

One use of knowing/looking at assembly is understanding what's actually going on, since some "magic" happens between your source code and executable. While doing the synthesis, we expect you will read the assembly, rather than use GDB, to answer the following questions.

In part_4, the nested loops come between the calls to clock() (callq <clock@plt> in the assembly). How many of the assembly instructions in the nested loops access (read or write to) memory? Provide the count and list of instructions for both the unoptimized and optimized code. For the lists of instructions, you can either copy and paste the instructions or include the relative address of each instruction (<func+##>). [2 pt]
- Only count each instruction once; ignore the looping mechanics and multiple accesses in a single instruction.
- nop* is a family of instructions that do nothing (i.e., they don't access memory).
- Don't count the instructions found in big_array_index, but the callq <big_array_index> instruction is part of the nested loops.
- Only count the number of data accesses caused by the execution of the instructions and not by their retrieval.
In the unoptimized code, where are the loop variables i, j, and k "stored?" Give the corresponding assembly operands (e.g., %r8, 2(%rax)) and not addresses. [3 pt]
In the optimized code, the loop variables have sort of disappeared! Name what the values found in registers %edx and %rcx correspond to in the C code. [2 pt]
Take a look at your transpose code for the 32 x 32 matrix. If the starting address of matrix A did not map to the beginning of a cache block, how would the number of misses be affected? (Increase / Stay the Same / Decrease) Briefly explain. [2 pt]
Take a look at your transpose code for the 64 x 64 matrix. If you were to run this code for a matrix of size 57 x 71, how would the number of misses be affected? (Increase / Stay the Same / Decrease) Briefly explain. [2 pt]
- You don’t need to modify your code to work with this matrix size. Use what you know about cache images to answer this.

Submission

You will submit: cache-test-skel.c, trans.c, and lab4synthesis.txt.

Make sure that your final transpose code is in a function named transpose_submit and that you've double-check your adherence to the programming rules with ./check_trans_vars trans.c.

After submitting, please wait until the autograder is done running and double-check that you passed the "File Check" and "Compilation and Execution Issues" tests. If either test returns a score of -1, be sure to read the output and fix any problems before resubmitting. Failure to do so will result in a programming score of ZERO for the lab.

Submit your files to the "Lab 4" assignment on . Don't forget to add your partner, if you have one.
If you completed the extra credit, also submit trans.c to the "Lab 4 Extra Credit" assignment.

It is fine to submit multiple times up until the deadline; we will only grade your last submission. If you do re-submit, you must re-submit BOTH files again AND add your partner again to the new submission.