CSE 326A Winter 2008, Project 1

CSE 326
Project 1
Evaluation of Sorting Algorithms

Code due: Monday, January 28, 11:00 PM
Report due: Wednesday, January 30, beginning of class

Objectives

Implement some standard sorting algorithms in C
Evaluate the algorithms with timing studies

Overview

Three of the most used sorting algorithm are insertion sort, mergesort, and quicksort. Mergesort comes in two basic varieties - recursive and iterative. Insertion sort and quicksort have worst case running time O(n²), while merge sort has worst case running time O(n log n). The average case running time of quicksort is O(n log n). This is all very nice theory, but it's good to know which algorithm should be used in a certain situation. Common wisdom states that insertion sort only does well on small input sets and/or sets that are almost sorted. Mergesort uses twice the memory as quicksort and does a lot of copying, so it is slower than quicksort. In this project you will either verify this common wisdom or refute it. There are a number of variations of these algorithms. For example, in quicksort there are several ways to choose the pivot. Choose a variation that suits you. Using various graphs you will report your results.

Assignment

1. Implement insertion sort, recursive mergesort, iterative mergesort, and two varieties of quicksort in C for 32 bit integers. For the first variety of quicksort, if the subarray to be sorted is less than size CUTOFF, call insertion sort on the subarray. For the second variety of quicksort, if the subarray to be sorted is less than size CUTOFF, just return. Then, after quicksort has finished, call insertion sort on the entire array.

2. Do a timing study to determine the best value for CUTOFF for both versions of quicksort. For data set size of roughly N = 10⁶ plot the normalized running time with CUTOFF values ranging from 1 to 25, where the normalized time is the running time divided by N log₂ N. Use your optimal CUTOFF in part 3 below.

3. Do a timing study to determine the average running time for both varieties of mergesort and both varieties of quicksort for data sets ranging from roughly 10² to 10⁷ elements. In separate plots show: (i) the running times on a log-log plot, running time vs. data set size, (ii) the normalized running times of both varieties of mergesort and both varieties of quicksort using a logarithmic scale only in the data set size axis. In this case normalization means divide the time by the number of keys. Each plot should have about 10 to 20 points that are equally spaced on the data set size log scale.

Experimental Methodology

Care must be taken to make sure that the numbers you generate are valid and reflect the true performance of your algorithms. For timing measurements you should use the provided CurrentTime() function to measure the wall clock time. A key issue in using any clock is ensuring that it provides millesecond or better resolution. You will need to run your experiments for several seconds to get accurate measurements.

You will discover the need to make repeated experiments for each of your data points. If you are running your experiment on a multi-tasking system, then your timing measurement can be affected by other tasks running on the same machine. How many trials should you make in order to get good data? What data should be thrown out? A common experimental approach is to make a small number (like 11) of trials and take the median as an indication of the correct value. This is better than taking the average of all 11, which might include bad data. To minimize interference from other tasks you might consider running your experiments on a single user computer. Even on a single user machine that is also multi-tasking you should be careful to stop any background jobs. Care must also be taken to avoid reading from or writing to files while you are making measurements. Disk I/0 times can easily dominate the computation times you are trying to measure. In general, you should use as little I/O (this includes printf() statements) as possible in your program, and certainly no I/O at all while you are timing the sorting algorithm itself. Even a simple printf() statement has to invoke a kernel system call, and is pretty expensive to perform. If you want to verify the correctness of your sorting algorithm, you may use the provided VerifySortOrder() function or write your own I/O-less way of ensuring that your sort works correctly.

Skeleton code

The skeleton code is available here. Use 'tar xf proj1.tar' to untar the code.

Handing in the code

You are required to hand in your project directory with your implementation of your sorting algorithms. Your code should be well commented to explain in English what it does. A README file should explain how your code can be executed. You may use the framework and skeleton code provided for this project. If you choose to write your own application from scratch, clearly document its usage in the README file. The code will be turned in electronically, no later than 11:00 PM on Monday, January 28th.

Handing in the report

The report should have a title, and three sections: introduction, methodology, and results. The introduction explains the overall project. The methodology section explains the details of your experiments. No more than a page is needed for these sections. The results section includes the graphs as described above, with accompanying text. Each graph should have a title, with x and y coordinates clearly labeled, and with a legend for each line. With each graph there should be a paragraph explaining what was found by the experiments. You should explain, or speculate on the explanation of, any interesting features in the data. The reports are due at the beginning of lecture on Wednesday, January 30.

Hints:

The programming portion of this assignment is small, but the overall workload is not as light as it may seem, so don't leave everything for the last minute. Report anything interesting that you come up with, especially if the numbers don't come out the way you expected. For plotting you can use excel, matlab, or mathematica. There is also a fairly nice tool called gnuplot in the Linux world.

Good luck!