Comparison Sorts Reading
Table of contents
Sorting is one of the most well-studied problems in computer science. Many problems reduce to sorting, such as duplicate finding. Many algorithms run significantly faster on sorted inputs, often for computer architecture reasons. We learn sorting algorithms for many of the same reasons for learning data structures: to learn more general algorithm design patterns and strategies that we can apply towards solving problems.
Problem definition
Formally, sorting is well-defined only when the ordering relation satisfies total order. Review total ordering in the reading on Sets, Maps, and BSTs.
A sort is a permutation (rearrangement) of a sequence of keys that puts the keys into non-decreasing order relative to a given ordering relation. In the Autocomplete homework, we defined three different total orderings for the SimpleTerm
class.
- Lexicographic (dictionary) order for the
query
string. - Lexicographic order for the first
r
characters of thequery
string. - Descending order for the term’s associated
weight
.
Using string length as the ordering relation, give two valid sorts for "cows", "get", "going", "the".
There are two valid sorts since “the” is considered equivalent to “get” when comparing by string length.
- “the”, “get”, “cows”, “going”
- “get”, “the”, “cows”, “going”
A sort is considered stable if the relative order of equivalent keys is maintained after sorting. Using the answer above, a stable sorting algorithm is guaranteed to return “get”, “the”, “cows”, “going” since “get” came before “the” in the original input.
Describe the result of stably sorting autocomplete terms first by decreasing weight order and then by lexicographic order.
The resulting sorted output will be in lexicographic order. For terms with the same query
string, they will be subsorted in decreasing weight order. If there are multiple “Starbucks” stores in the output, they will be ordered by their weighted importance.
As a result, it’s often (but not always) important to use a stable sorting algorithm when our data contains multiple fields.
Give an example of a Java data type that does not contain multiple fields.
Primitive data types such as int
or double
are not affected by stability since numbers do not have other fields. Mixing up the relative ordering of two equal numbers has no effect on any client programs.
Strings are also a good example. While we might be interested in defining different ways of sorting strings (such as by length or content), the value of a string is entirely dependent on the characters in the string. The string “Starbucks” is equal to any other string “Starbucks”. It’s only when we introduce weighted terms (representing different stores) that stable sorting matters.
Java follows this rule: Arrays.sort
and Collections.sort
use different sorting algorithms based on your data type. Primitive types are sorted with a faster but unstable sorting algorithm whereas reference types are sorted with a slower but stable sorting algorithm. Knowing whether or not your algorithm requires stability can yield a constant factor speedup.
Algorithms
We evaluate a sorting algorithm in terms of both asymptotic time complexity (runtime) and asymptotic space complexity (memory usage). The space complexity of an algorithm is based on the additional memory needed to run the algorithm; we don’t count the order N memory needed to represent the unsorted input.
Furthermore, we assume that our unsorted input is always given as an array. To sort linked lists, for example, Java copies all of the linked list items into a temporary array, sorts the temporary array, and then overwrites the items in the original linked list with the sorted sequence.
Selection sort
Selection sort maintains a sorted section at the front of the array and gradually grows it by repeatedly selecting the smallest remaining item and swapping it to its proper index.
- Find the smallest item in the array, and swap it with the first item.
- Find the second smallest item in the array, and swap it with the second item.
- Continue until all items in the array are sorted.
Selection sort has a time complexity in Theta(N2) and space complexity in Theta(1). Review selection sort in the reading on Iterative Algorithm Analysis.
Heapsort
Heapsort is an optimization of selection sort by replacing the linear scan with a min-heap data structure. This approach requires two arrays of the same size as the input to store the heap and the sorted output. Each item removed from the heap is one item added to the sorted section.
However, heapsort is typically implemented as an in-place algorithm, one which transforms the input using no extra data structures but allowing a constant number of additional variables. (In other words, constant space complexity.) Since we can’t create new arrays, the input array will need to represent both the heap of unsorted items as well as the sorted output.
- There’s contention at the front of the array. Both the heap representation and the sorted section want to use the front of the array. In-place heapsort maintains a max-heap rather than a min-heap. Each item removed from the max-heap is conveniently placed at the end of the array, exactly where it should go in the sorted output.
- It’s not clear how to even build a heap representation sharing the same array as input. We might end up accidentally overwriting items. In-place heapsort constructs a heap using in-place, bottom-up heapification.
Bottom-up heapification is an efficient algorithm for heap construction that works by sinking nodes in reverse level order from the “bottom-up”.
What is the runtime for bottom-up heapification?
The runtime of bottom-up heapification is in O(N log N). Each of the N items may need to sink down to a leaf node. At worst, this sink procedure needs to swap log N times.
It’s possible to prove a tighter bound: that the runtime for bottom-up heapification is in Theta(N), though the analysis is out-of-scope. Intuitively, only the root will need to sink up to log N levels and most of the nodes don’t need to sink at all.
In-place heapsort has an overall time complexity in O(N log N) and space complexity in Theta(1).
- Bottom-up max-heapification.
- Repeatedly delete the max item until the heap is empty (N iterations).
Merge sort
Merge sort is a recursive sorting algorithm that relies on the fact that an array of length 1 is sorted.
- If the array is of size 1, return.
- Recursively merge sort the left half.
- Recursively merge sort the right half.
- Merge the two sorted halves.
Merge sort has a time complexity in Theta(N log N) and space complexity in Theta(N). Review merge sort in the reading on Recursive Algorithm Analysis.
Summary
Algorithm | Runtime | Space | Stable |
---|---|---|---|
Selection sort | O(N2) | O(1) | No |
Heapsort | O(N log N) | O(1) | No |
Merge sort | O(N log N) | O(N) | Yes |
In practice, heapsort is typically slower than the competing unstable O(N log N) sort, quicksort, which we’ll see in the next lecture.