CSE 326 Autumn 2005, Project 1

CSE 326
Project 1
Evaluation of Sorting Algorithms

Code due: Monday, October 24, 11:00 PM
Report due: Wednesday, October 26, beginning of class

Objectives

Implement some standard sorting algorithms in C
Evaluate the algorithms with timing studies
Evaluate the memory performance and instruction counts of the algorithms with the cache simulator CacheGrind

Overview

Three of the most used sorting algorithm are insertion sort, mergesort, and quicksort. Mergesort comes in two basic varieties - recursive and iterative. Insertion sort and quicksort have worst case running time O(n²), while merge sort has worst case running time O(n log n). The average case running time of quicksort is O(n log n). This is all very nice theory, but it's good to know which algorithm should be used in a certain situation. Common wisdom states that insertion sort only does well on small input sets and/or sets that are almost sorted. Mergesort uses twice the memory as quicksort and does a lot of copying, so it is slower than quicksort. In this project you will either verify this common wisdom or refute it. There are a number of variations of these algorithms. For example, in quicksort there are several ways to choose the pivot. Choose a variation that suits you. In addition, you will learn about memory performance and how it influences overall performance. Using various graphs you will report your results.

Assignment

1. Implement insertion sort, recursive mergesort, iterative mergesort, and two varieties ofquicksort in C for 32 bit integers. For the first variety of quicksort, if the subarray to be sorted is less than size CUTOFF, call insertion sort on the subarray. For the second variety of quicksort, if the subarray to be sorted is less than size CUTOFF, just return. Then, after quicksort has finished, call insertion sort on the entire array.

2. Do a timing study to determine the average running time for both varieties of mergesort and both varieties of quicksort for data sets ranging from roughly 10² to 10⁷ elements. Show the running times on a log-log plot, running time vs. data set size. Show the normalized running times of both varieties of mergesort and both varieties of quicksort on a second plot, logarithmic only in the data set size axis. In this case normalization means divide the time by the number of keys. Each plot should have about 10 to 20 points that are equally spaced on the data set size log scale.

3. Do a cache simulation of both varieties of the quicksort and mergesort algorithms using CacheGrind, for data ranging from 10³ to 10⁶ elements. Record instructions, L1 cache misses, and L2 cache misses for 8KB I1/D1 cache and and 64KB L2 cache . Plot normalized instruction counts, normalized L1 cache misses, and normalized L2 cache misses for both varieties of mergesort and quicksort with data set size on a log scale. Each plot should have about 10 to 20 points that are equally spaced on the data set size log scale.

Experimental Methodology

Care must be taken to make sure that the numbers you generate are valid and reflect the true performance of your algorithms. For timing measurements you should use the provided CurrentTime() function to measure the wall clock time. A key issue in using any clock is ensuring that it provides millesecond or better resolution. You will need to run your experiments for several seconds to get accurate measurements.

You will discover the need to make repeated experiments for each of your data points. If you are running your experiment on a multi-tasking system, then your timing measurement can be affected by other tasks running on the same machine. How many trials should you make in order to get good data? What data should be thrown out? A common experimental approach is to make a small number (like 11) of trials and take the median as an indication of the correct value. This is better than taking the average of all 11, which might include bad data. To minimize interference from other tasks you might consider running your experiments on a single user computer. Even on a single user machine that is also multi-tasking you should be careful to stop any background jobs. Care must also be taken to avoid reading from or writing to files while you are making measurements. Disk I/0 times can easily dominate the computation times you are trying to measure. In general, you should use as little I/O (this includes printf() statements) as possible in your program, and certainly no I/O at all while you are timing the sorting algorithm itself. Even a simple printf() statement has to invoke a kernel system call, and is pretty expensive to perform. If you want to verify the correctness of your sorting algorithm, you may use the provided VerifySortOrder() function or write your own I/O-less way of ensuring that your sort works correctly.

CacheGrind

For a part of this project, you have to simulate the memory performance of sort algorithms using CacheGrind. CacheGrind is an x86/Linux cache simulator used for cache miss profiling. Why are we interested in cache performance? In most modern computers, there is a huge gap between memory and CPU speeds (a typical PC might have a 2-3 GHz CPU and 200-400 MHz RAM). If every memory access had to go to out to main memory, the CPU would spend most of it's time waiting for data and instructions to arrive. The solution to the "memory gap" problem , as it's known, is to organize computer memory into a memory hierarchy. The memory is split up into levels, such that levels closer to the CPU are fast and small, and levels further from the CPU are slow and big. The memory hierarchy in a typical personal computer (circa 2004) might look like this:

Registers (closest to CPU). Access time = 0 cycles, size = measured in bytes
L1 cache. Access time = 0-2 cycles, size = 64+ KB
L2 cache. Access time = ~10 cycles, size = 512+ KB
Main memory. Access time = 100+ cycles, size = 1+ GB
Hard disk (farthest from CPU). Access time = 100,000+ cycles, size = 100+ GB

In order for the CPU to execute near its full speed, the majority (close to 99%) of all memory accesses have to hit in the L1 CPU cache, and the majority of L1 misses have to hit in the L2 cache. The cost of going out to main memory (or page faulting to disk) can easily dominate the costs of most other operations in a program. Because of this, the execution time of many programs is limited in practice by their cache performance. Programs achieve good cache performance by having good locality of reference.

Using CacheGrind will allow you to examine the cache performance of your sort algorithms. You will be simulating an 8KB, 8-way I1 cache with 32 byte lines, an 8KB, 4-way D1 cache with 64 byte lines, and a 64KB, 4-way D2 cache with 64 byte lines. First make sure that you have the correct path. In bash, run

export PATH=$PATH:/cse/courses/cse326/05wi/bin

Then, to run CacheGrind with these options, execute

cachegrind --I1=8192,8,32 --D1=8192,4,64 --L2=65536,4,64 -v <program> <args>

Keep in mind that running a program in CacheGrind is about 100 times slower than running it natively. Because of this, you will use smaller data sets for merge sort and quicksort.

Skeleton code

The skeleton code is available here. Use 'tar xf proj1.tar' to untar the code.

Handing in the code

You are required to hand in your project directory with your implementation of your sorting algorithms. Your code should be well commented to explain in English what it does. A README file should explain how your code can be executed. You may use the framework and skeleton code provided for this project. If you choose to write your own application from scratch, clearly document its usage in the README file. The code will be turned in electronically, no later than 11:00 PM on Monday, October 24th.

Handing in the report

The report should have a title, and three sections: introduction, methodology, and results. The introduction explains the overall project. The methodology section explains the details of your experiments. No more than a page is needed for these sections. The results section includes the graphs as described above, with accompanying text. Each graph should have a title, with x and y coordinates clearly labeled, and with a legend for each line. With each graph there should be a paragraph explaining what was found by the experiments. You should explain, or speculate on the explanation of, any interesting features in the data. The reports are due at the beginning of lecture on Wednesday, October 26.

Hints:

The programming portion of this assignment is small, but the overall workload is not as light as it may seem, so don't leave everything for the last minute. Report anything interesting that you come up with, especially if the numbers don't come out the way you expected. For plotting you can use excel, matlab, or mathematica. There is also a fairly nice tool called gnuplot in the Linux world.

Finally, we encourage you to go beyond the bare assignment requirements when experimenting with CacheGrind. You might want to try various modifications (different pivots, different values of CUTOFF) and compare their cache performance.

Good luck!