Three of the most used sorting algorithm are
insertion sort, mergesort, and quicksort.
Mergesort comes in two basic varieties - recursive
and iterative. Insertion sort and quicksort
have worst case running time O(n2), while
merge sort has worst case running time O(n log n). The average case
running time of quicksort is O(n
log n). This is all very nice theory, but it's good to know which
algorithm should be used in a certain situation. Common wisdom states
that insertion sort only does well on small input sets and/or sets that are
almost sorted. Mergesort uses twice the memory
as quicksort and does a lot of copying, so it is
slower than quicksort. In this project you
will either verify this common wisdom or refute it. There are a number of
variations of these algorithms. For
example, in quicksort there are several ways to choose the pivot.
1. Implement
insertion sort, recursive mergesort, iterative mergesort, and two
varieties of
2. Do a timing
study to determine the average running time for both varieties of
mergesort and both varieties of quicksort for
data sets ranging from roughly 102 to 107
elements. Show the running times
on a log-log plot, running time vs. data set size. Show the
normalized running times of both varieties of mergesort and both varieties of
3. Do a cache
simulation of both varieties of the quicksort and mergesort algorithms
using CacheGrind,
for data ranging from 103 to 106
elements. Record instructions, L1 cache misses, and L2
cache misses for 8KB I1/D1 cache and and 64KB L2 cache . Plot
normalized instruction counts, normalized L1 cache misses, and
normalized L2 cache misses for
both varieties of
Experimental Methodology
Care must be taken to make sure that the numbers you generate are valid and
reflect the true performance of your algorithms. For timing measurements you
should use the provided CurrentTime() function to measure the wall clock time.
A key issue in using any clock is ensuring that it provides millesecond
or better resolution. You will need to run your experiments for
You will discover the need to make repeated experiments for each of your data points. If you are running your experiment on a multi-tasking system, then your timing measurement can be affected by other tasks running on the same machine. How many trials should you make in order to get good data? What data should be thrown out? A common experimental approach is to make a small number (like 11) of trials and take the median as an indication of the correct value. This is better than taking the average of all 11, which might include bad data. To minimize interference from other tasks you might consider running your experiments on a single user computer. Even on a single user machine that is also multi-tasking you should be careful to stop any background jobs. Care must also be taken to avoid reading from or writing to files while you are making measurements. Disk I/0 times can easily dominate the computation times you are trying to measure. In general, you should use as little I/O (this includes printf() statements) as possible in your program, and certainly no I/O at all while you are timing the sorting algorithm itself. Even a simple printf() statement has to invoke a kernel system call, and is pretty expensive to perform. If you want to verify the correctness of your sorting algorithm, you may use the provided VerifySortOrder() function or write your own I/O-less way of ensuring that your sort works correctly.
CacheGrind
For a part of this project, you have to simulate the memory performance of
sort algorithms using CacheGrind. CacheGrind is an x86/Linux cache simulator
used for cache miss profiling. Why are we interested in cache performance? In
most modern computers, there is a huge gap between memory and CPU speeds (a
typical PC might have a 2-3 GHz CPU and 200-400 MHz RAM). If every memory
access had to go to out to main memory, the CPU would spend most of it's time
waiting for data and instructions to arrive. The solution to the "memory
gap" problem , as it's known, is to organize computer memory into a memory
hierarchy. The memory is split up into levels, such that levels closer to the
CPU are fast and small, and levels further from the CPU are slow and big. The
Registers (closest to CPU). Access time = 0 cycles, size = measured in bytes
L1 cache. Access time = 0-2 cycles, size = 64+ KB
L2 cache. Access time = ~10 cycles, size = 512+ KB
Main memory. Access time = 100+ cycles, size = 1+ GB
Hard disk (farthest from CPU). Access time = 100,000+ cycles, size = 100+ GB
In order for the CPU to execute near its full speed, the majority (close to 99%) of all memory accesses have to hit in the L1 CPU cache, and the majority of L1 misses have to hit in the L2 cache. The cost of going out to main memory (or page faulting to disk) can easily dominate the costs of most other operations in a program. Because of this, the execution time of many programs is limited in practice by their cache performance. Programs achieve good cache performance by having good locality of reference.
Using CacheGrind will allow you to examine the cache performance of your sort algorithms. You will be simulating an 8KB, 8-way I1 cache with 32 byte lines, an 8KB, 4-way D1 cache with 64 byte lines, and a 64KB, 4-way D2 cache with 64 byte lines. First make sure that you have the correct path. In bash, run
export PATH=$PATH:/cse/courses/cse326/05wi/bin
Then, to run CacheGrind with these options, execute
cachegrind --I1=8192,8,32 --D1=8192,4,64 --L2=65536,4,64 -v <program> <args>
Keep in mind that running a program in CacheGrind is about 100 times slower than running it natively. Because of this, you will use smaller data sets for merge sort and quicksort.
Skeleton code
The skeleton code is available here. Use 'tar xf proj1.tar' to untar the code.
Handing in the code
You are required to hand in your project directory with your implementation
of your sorting algorithms. Your code should be well commented to explain in
English what it does. A README file should explain how your code can be
Handing in the report
The report should have a title, and three sections: introduction, methodology, and results. The introduction explains the overall project. The methodology section explains the details of your experiments. No more than a page is needed for these sections. The results section includes the graphs as described above, with accompanying text. Each graph should have a title, with x and y coordinates clearly labeled, and with a legend for each line. With each graph there should be a paragraph explaining what was found by the experiments. You should explain, or speculate on the explanation of, any interesting features in the data. The reports are due at the beginning of lecture on Wednesday, October 26.
The programming portion of this assignment is small, but the overall workload is not as light as it may seem, so don't leave everything for the last minute. Report anything interesting that you come up with, especially if the numbers don't come out the way you expected. For plotting you can use excel, matlab, or mathematica. There is also a fairly nice tool called gnuplot in the Linux world.
Finally, we encourage you to go beyond the bare assignment requirements when experimenting with CacheGrind. You might want to try various modifications (different pivots, different values of CUTOFF) and compare their cache performance.
Good luck!