CSE 410 22wi Homework 6
Cache Aware Programming

Out: Thursday, 17 February
Due: Wednesday, 23 February
Latest Possible Submission: Saturday, 26 February
Turnin: Gradescope

Overview

The goal of this assignment is to observe the impact that cache aware implementation can have on performance. We use two implementations of matrix multiply as our application. You'll run code on klaatu and measure its performance. We are not asking you to do anything more than that. If you'd like to, though, we suggest trying to optimize performance (as measured by CPU time consumed). You can build and the execute the code like this:

    $ gcc -O3 hw6-blocked.c -o myHw6
    $ ./a.out 1600 4

Running the Code

You should run on klaatu.

We provide (on klaatu):

/courses/cse410/22wi/hw6/hw6.c
The source code for naive matrix multiply of two NxN matrices. You can look at it, but you don't have to do anything more than that with it.
/courses/cse410/22wi/hw6/hw6
The compiled/executable version of that code. You can invoke it on klaatu from any directory (for instance, your home directory) like this:
```
        $ /courses/cse410/22wi/hw6/hw6 200
```
It takes one argument (here the argument is "200"), which is the number of rows and columns, N, in the NxN matrices that will be multiplied.
/courses/cse410/22wi/hw6/hw6-blocked.c
The source code for the cache aware (blocked) implementation of matrix multiply.
/courses/cse410/22wi/hw6/hw6-blocked
The executable for the blocked version of matrix multiply. You can invoke it like this:
```
        $ /courses/cse410/22wi/hw6/hw6-blocked 200 10
```
The first argument is the number of rows and columns in the NxN matrices to be multiplied, and the second is the number of rows and columns in the BxB sized blocks into which the overall multiplication is decomposed.

Question 1

Use the Linux time command to measure running time. For instance,

    klaatu:~ $ time /courses/cse410/22wi/hw6/hw6 100

    real    0m0.293s
    user    0m0.002s
    sys     0m0.002s

"real" is wall clock elapsed time, and isn't relevant to us. "user" is CPU time spent executing the program's code, and "sys" is CPU time spent in the operating system on behalf of the program. We will use the sum of those two (in this case, 0.004 sec) as the measure of the program execution time.

Run the measurement repeatedly, each time doubling the parameter N. Go to at least 6400. Record and turn in the results.

Because this matrix multiply implementation is an O(N³) computation, you might expect the CPU time to go up by a factor of 8 each time you double the size of the matrix. What actually happens? Do you have an explanation for why performance behaves the way it does? You do not need to turn in anything about these questions.

Question 2

Now time the execution of the blocked version of implementation with N=1600. Choose different block sizes. Record and turn in the results.

Again, we have some questions to think about, but none of them require that you turn anything in:

Is the blocked version faster than the naive version? By how much?
Try to find the best block size (in terms of CPU time consumed). Is a very small block size best? A very large block size?
Imagine (or even try) experimenting on a much larger matrix. Would the ratio of CPU times consumed by naive and blocked increase, decrease, or remain the same?

Question 3

Use a tool that estimates the L1 (1st level cache) and LL (last level cache) miss rates for reads and writes. Here is an example invocation and resulting output:

klaatu:~/cse410/22wi/homeworks/hw6 $ valgrind --tool=cachegrind ./hw6-blocked 1600 10
==633131== Cachegrind, a cache and branch-prediction profiler
==633131== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==633131== Using Valgrind-3.16.0 and LibVEX; rerun with -h for copyright info
==633131== Command: ./hw6-blocked 1600 10
==633131== 

==633131== 
==633131== I   refs:      34,704,219,294
==633131== I1  misses:             1,120
==633131== LLi misses:             1,113
==633131== I1  miss rate:           0.00%
==633131== LLi miss rate:           0.00%
==633131== 
==633131== D   refs:       9,000,509,476  (8,519,855,395 rd   + 480,654,081 wr)
==633131== D1  misses:       133,791,539  (  133,470,883 rd   +     320,656 wr)
==633131== LLd misses:        52,165,979  (   51,845,363 rd   +     320,616 wr)
==633131== D1  miss rate:            1.5% (          1.6%     +         0.1%  )
==633131== LLd miss rate:            0.6% (          0.6%     +         0.1%  )
==633131== 
==633131== LL refs:          133,792,659  (  133,472,003 rd   +     320,656 wr)
==633131== LL misses:         52,167,092  (   51,846,476 rd   +     320,616 wr)
==633131== LL miss rate:             0.1% (          0.1%     +         0.1%  )

The first output section just tells you about the tool and how you invoked it. The next section is about instruction fetches, and how many misses there were in the first level and last level caches. The third section is about references to data, and cache misses. The final section is a summary of both instruction and data references.

The results from step 2 probably didn't agree with the simple analysis presented in class. Do the more detailed results from this step give you any insights into why? You might want to run the tool on both the naive and blocked versions.

Warning: the tool is a simulator, and so programs take much longer to run on it than they do when run directly on the hardware.

Run the tool on the block matrix multiply implementation for matrix size, N, of 1600 and using the best block size you found in Question 2. Hand in the output of that run.

We suggest trying some other block sizes as well and comparing the results. How big do changes in the miss rate have to be to make a noticable difference in execution time? (Again, notihng to hand in for these questions.)

Turn-in

You should hand in a table of measurements for Question 1, a table of measurements for Question 2, and the output of running the tool for Question 3.

Note that you are simply taking measurements. The right answer is whatever you observe, assuming you have run the experiment correctly.