CSE 326, Project 1, Part 2

CSE 326
Project 1
Evaluation of Sorting Algorithms, Part II

Code due: Wednesday, Feb 9th @ 11:00 am
Report due: Wednesday, Feb 9th at the beginning of class

Objectives

Implement some variations of the mergesort algorithm in C.
Evaluate the algorithms and determine parameters with timing studies
Evaluate the memory performance and instruction counts of the algorithms with the cache simulator CacheGrind

Overview

Mergesort is one of the best O(n log n) sorting algorithms. The iterative version of mergesort uses very few instructions, but on large data sets has poor memory performance because of the repeated long scans over the two arrays. An alternative to mergesort that has better memory performance is multi-mergsort. In this project you will compare these two algorithms with recursive mergesort in terms of running time and cache performance. Generally, it can be hard to determine the size of the subarrays in multi-mergesort that gives best performance. As part of this project you will perform an experiment to determine the best size. There are several ways to implement the multi-merge tournament in multi-mergesort. As an option you may want to compare the two alternatives: the insertion sort method and the tournament tree method.

Assignment

1. Implement recursive mergesort (which you have already done in part 1), iterative mergesort, and multi-mergesort in C for 32 bit integers. Multi-merge must have an extra parameter for the size of the subarrays that are sorted before the multi-merge step. Your algorithm should work for any input size, not just powers of two.

2. For a data set on the order of 10⁷, time multi-mergsort for a number of values of the subarray size to determine the best value of this parameter. Plot the running time for 10 to 20 settings of the subarray size parameter. Choose points that demonstrate that the best value is fairly precisely determined.

3. Do a timing study to determine the average running time for each of these mergesort algorithms for data sets ranging from roughly 10² to 10⁷ elements. Use the best subarray size determined in part 2. Show the normalized running times of recursive mergesort, iterative mergesort, and multi-mergesort on a second plot with data set size on a log scale. In this case normalization means divide the time by the number of keys. Each plot should have about 10 to 20 point that are equally spaced on the data set size log scale.

4. Do a cache simulation of these mergesort algorithms using CacheGrind, for data ranging from 10³ to 10⁶ elements. Record instructions ("I refs"), L1 cache misses ("D1 misses") , and L2 cache misses ("L2 misses") for the default cache parameters (16KB I1 / 8KB D1 cache and and 512KB L2 cache). For multi-mergesort choose the subarray size to be 256KB (1/2 the L2 cache size) - since an int is 4 bytes, this is equivalent to 64K elements. Plot normalized instruction counts, L1 cache misses, and L2 cache misses for both mergesort algorithms and with data set size on a log scale. Each plot should have about 10 to 20 points that are equally spaced on the data set size log scale.

5. (Optional) Implement two versions of multi-merge: insertion sort and tournament tree. Compare the normalized running times of multi-mergesort using these versions of multi-merge algorithms. Determine if there is really a difference between these two implementations. If you implement this, clearly state so in your write-up and README files. Include instructions for running both versions of your multi-merge in the README file. (There are no additional points for this optional part of the project. However, we will keep track of who successfully completes this option. Success with optional problems will help decide borderline grades at the end of the quarter.)

Experimental Methodology (same as in part I of the project)

Care must be taken to make sure that the numbers you generate are valid and reflect the true performance of your algorithms. For timing measurements you should use the provided CurrentTime() function to measure the wall clock time. A key issue in using any clock is ensuring that it provides millesecond or better resolution. You will need to run your experiments for several seconds to get accurate measurements.

You will discover the need to make repeated experiments for each of your data points. If you are running your experiment on a multi-tasking system, then your timing measurement can be affected by other tasks running on the same machine. How many trials should you make in order to get good data? What data should be thrown out? A common experimental approach is to make a small number (like 11) of trials and take the median as an indication of the correct value. This is better than taking the average of all 11, which might include bad data. To minimize interference from other tasks you might consider running your experiments on a single user computer. Even on a single user machine that is also multi-tasking you should be careful to stop any background jobs. Care must also be taken to avoid reading from or writing to files while you are making measurements. Disk I/0 times can easily dominate the computation times you are trying to measure. In general, you should use as little I/O (this includes printf() statements) as possible in your program, and certainly no I/O at all while you are timing the sorting algorithm itself. Even a simple printf() statement has to invoke a kernel system call, and is pretty expensive to perform. If you want to verify the correctness of your sorting algorithm, you may use the provided VerifySortOrder() function or write your own I/O-less way of ensuring that your sort works correctly.

CacheGrind (same as in part I of the project)

NOTE: You will use different cache parameters for this part of the project. Make sure you use the default parameters instead of specifying an 8kb I1 / 8kb D1 / 64 kb L2 cache.

For a part of this project, you have to simulate the memory performance of sort algorithms using CacheGrind. CacheGrind is an x86/Linux cache simulator used for cache miss profiling. Why are we interested in cache performance? In most modern computers, there is a huge gap between memory and CPU speeds (a typical PC might have a 2-3 GHz CPU and 200-400 MHz RAM). If every memory access had to go to out to main memory, the CPU would spend most of it's time waiting for data and instructions to arrive. The solution to the "memory gap" problem , as it's known, is to organize computer memory into a memory hierarchy. The memory is split up into levels, such that levels closer to the CPU are fast and small, and levels further from the CPU are slow and big. The memory hierarchy in a typical personal computer (circa 2004) might look like this:

Registers (closest to CPU). Access time = 0 cycles, size = measured in bytes
L1 cache. Access time = 0-2 cycles, size = 64+ KB
L2 cache. Access time = ~10 cycles, size = 512+ KB
Main memory. Access time = 100+ cycles, size = 1+ GB
Hard disk (farthest from CPU). Access time = 100,000+ cycles, size = 100+ GB

In order for the CPU to execute near its full speed, the majority (close to 99%) of all memory accesses have to hit in the L1 CPU cache, and the majority of L1 misses have to hit in the L2 cache. The cost of going out to main memory (or page faulting to disk) can easily dominate the costs of most other operations in a program. Because of this, the execution time of many programs is limited in practice by their cache performance. Programs achieve good cache performance by having good locality of reference.

Using CacheGrind will allow you to examine the cache performance of your sort algorithms. You will be simulating a 16KB, 8-way I1 cache with 32 byte lines, an 8KB, 4-way D1 cache with 64 byte lines, and a 512KB, 8-way D2 cache with 64 byte lines. First make sure that you have the correct path. In bash, run

export PATH=$PATH:/cse/courses/cse326/05wi/bin

Then, to run CacheGrind with these options, execute

cachegrind -v <program> <args>

Another way is to use the alias command. Type

alias cg="/cse/courses/cse326/05wi/bin/cachegrind -v"

Now you can run your program in cachegrind by calling

cg <program> <args>

Keep in mind that running a program in CacheGrind is about 100 times slower than running it natively.

Skeleton code

Update: The skeleton code for part 2 is different than the skeleton code for part 1. Please get the updated code. You will have to manually paste in the changes you made to the skeleton code in part 1.

The skeleton code is available in the CSE 326 course directory on attu at /cse/courses/cse326/05wi/prj1-2.tar.gz. To extract the archive, copy it to a folder and extract with the tar command:

cp /cse/courses/cse326/05wi/prj1-2.tar.gz ~/
cd ~
tar zxvf ./prj1-2.tar.gz

Handing in the code

You are required to hand in your project directory with your implementation of the sorting algorithms. Your code should be well commented to explain in English what it does. A README file should explain how your code can be executed. You may use the framework and skeleton code provided for this project. If you choose to write your own application from scratch, clearly document its usage in the README file. The code will be turned in electronically, no later than 11:00 am on Wednesday, Feb 9th.

Turn-in instructions: a link to the Catalyst turn-in tool will be posted on the course web page.

Handing in the report

The report should have a title, and three sections: introduction, methodology, and results. The introduction explains the overall project. The methodology section explains the details of your experiments. The results section includes the all the graphs described above, with accompanying text. Each graph should have a title, with x and y coordinates clearly labeled, and with a legend for each line. With each graph there should be a paragraph explaining what was found by the experiments. The reports are due at the beginning of lecture on Wednesday, Feb 9th.

Hints:

The programming portion of part II is somewhat larger than part I. Report anything interesting that you come up with, especially if the numbers don't come out the way you expected. For plotting you can use excel, matlab, or mathematica. There is also a fairly nice tool called gnuplot in the Linux world.

Good luck!