Link

Sorting and Algorithm Bounds Reading

Complete the Reading Quiz by 3:00pm before lecture.

Real-World Sorting

Java decides which sorting algorithm to use based on the type of data to sort. For reference types (ie, objects such as String), a variant of MergeSort is used. For primitive types (eg, int, double), Java often uses dual-pivot QuickSort instead. When implementing a general-purpose sorting algorithm, it’s important for the Java developers to optimize for real-world runtime in addition to asymptotic time complexity.

One common optimization to real-world sorting algorithm runtime is to make the sort adaptive.

Sorting Reference Types

Java’s MergeSort – the sorting algorithm used for reference types – is adaptive in that it takes advantage of features of the data. Recall that MergeSort works by repeatedly merging the sorted halves of an array. Real-world data is often not totally random: depending on the data source, there’s often some bias in the sampling or collection methods. This bias can take the form of natural runs in the data: subsequences of items in ascending (non-decreasing) or descending order.

a0 <= a1 <= a2 <= ...

This is exactly what Java’s version of MergeSort is looking to merge. Instead of waiting for recursive calls to produce sorted subarrays, it takes advantage of these natural runs and merges them together to form longer sorted subsequences. This modification of MergeSort has a linear order of growth in the best case when the input array is already sorted: the entire input array is a single, non-decreasing run.

Sorting Primitive Types

Java’s QuickSort – the sorting algorithm used for primitive types – is also adaptive. For example, when calling Arrays.sort(int[] a), Java may choose to use InsertionSort instead of QuickSort if the array is very small. In Java 11, the threshold for InsertionSort is 47; that is, arrays with strictly fewer than 47 items will be InsertionSorted rather than QuickSorted. This decision is repeated recursively; that is, InsertionSort is also invoked when the partitioned subarray is less than 47 items long.

/**
 * If the length of an array to be sorted is less than this
 * constant, insertion sort is used in preference to Quicksort.
 */
private static final int INSERTION_SORT_THRESHOLD = 47;

While the exact reasoning for why InsertionSort is faster than QuickSort on small arrays is out of scope, informally InsertionSort’s advantage is in directly fixing inversions. The exact value of 47 was picked through randomized testing; the Java developers determined which sorting algorithm was the most effective for arrays of a given length by comparing the time each sorting algorithm took to sort an array.

Furthermore, MergeSort’s data-oriented optimization is so effective that Java’s QuickSort implementation switches to optimized MergeSort if the data contains a small number of runs. Remember that Java’s version of MergeSort performs best with a small number of long runs rather than a large number of short (possibly even single-element) runs.

/**
 * The maximum number of runs in merge sort.
 */
private static final int MAX_RUN_COUNT = 67;

In summary, Arrays.sort(int[] a) decides between three sorting algorithms based on the properties of the input array.

  • If the entire array is very structured (fewer than 67 runs), use adaptive MergeSort even though the input type is primitive.
  • (Re-)Decide between QuickSort and InsertionSort:
    • If the input is tiny (fewer than 47 items), use InsertionSort.
    • Otherwise, since the input is large and relatively unstructured, do a three-way partition and for each partition recursively redecide between QuickSort and InsertionSort.

Theoretical Sorting Bound

The previous section describing real-world optimizations is interesting, but what about arrays containing random values? The runtime of adaptive MergeSort and QuickSort remains order N log(N) . Intuitively, random arrays are unlikely to have long runs, so there’s limited structure for a sorting algorithm to utilize.

In algorithm design, it’s useful to give a lower bound and an upper bound on the time complexity of an “ideal” solution. We used this strategy to help designan algorithm to partition an array around a pivot item by arguing that N log(N) did “unnecessary” work. Similarly, we know that any sorting algorithm needs to look at all of the data at least once, so a theoretical best-possible sorting algorithm must take at least linear time.


Reading Quiz