Link

Sorting and Algorithm Bounds Reading

Complete the Reading Quiz by noon before lecture.

Java decides which sorting algorithm to use based on the type of data to sort. For primitive types, dual-pivot quicksort is used. For reference types, a variant of merge sort is used instead. When implementing a system sort, it’s important for the Java developers to optimize for real-world runtime in addition to asymptotic time complexity.

One common optimization to real-world sorting algorithm runtime is to make the sort adaptive. When calling Arrays.sort(int[] a), Java decides between using insertion sort if the array is very small. In Java 11, arrays with strictly fewer than 47 items will be insertion sorted rather than quicksorted.

/**
 * If the length of an array to be sorted is less than this
 * constant, insertion sort is used in preference to Quicksort.
 */
private static final int INSERTION_SORT_THRESHOLD = 47;

While the exact reasoning for why insertion sort is fastest on small arrays is out of scope, informally, insertion sort’s advantage is in directly fixing individual invariants. The exact value of 47 is picked through randomized testing. The Java developers determine which sorting algorithm is the most effective for arrays of a given length by comparing the time each sorting algorithm takes to sort an array.

Java’s merge sort implementation is also adaptive but goes one step further by taking advantage of features of the data. Recall that merge sort works by repeatedly merging the sorted halves of an array. Real-world data is often not totally random: depending on the data source, there’s often some bias in the sampling or collection methods. This bias can take the form of natural runs in the data: subsequences of items in ascending (non-decreasing) or descending order.

a0 <= a1 <= a2 <= ...

This is exactly what merge sort is looking to merge. Instead of waiting for perfect split halves in the data, it can make sense to take advantage of these natural runs and gradually merge them together to form longer sorted subsequences. This modification of merge sort has a linear order of growth in the best case when the input array is already sorted: the entire input array is a single, non-decreasing run.

This data-oriented optimization is so effective that Java’s quicksort implementation switches to optimized merge sort if the data contains a small number of runs.

/**
 * If the length of an array to be sorted is less than this
 * constant, Quicksort is used in preference to merge sort.
 */
private static final int QUICKSORT_THRESHOLD = 286;

/**
 * The maximum number of runs in merge sort.
 */
private static final int MAX_RUN_COUNT = 67;

In summary, Arrays.sort(int[] a) decides between three sorting algorithms based on the properties of the input array.

  • If the array is tiny (fewer than 47 items), use insertion sort.
  • If the array is small (fewer than 286 items), use dual-pivot quicksort.
  • If the array is very structured (fewer than 67 runs), use adaptive merge sort.
  • Otherwise, since the array is large and relatively unstructured, use dual-pivot quicksort.

Theoretical Sorting Bound

However, given an array containing random values, the runtime of adaptive merge sort and quicksort remains order N log N. Intuitively, random arrays are unlikely to have long runs, so there’s limited structure for a sorting algorithm to utilize. In algorithm design, it’s useful to give a lower bound and an upper bound on the time complexity of a solution to a problem. We’ve used this strategy to help in developing algorithms for partitioning an array around a pivot item and for selecting the rank K item. Similarly, we know that any sorting algorithm needs to look at all of the data at least once, so a theoretical best-possible sorting algorithm must take at least linear time.


Reading Quiz