Multi-Pass Parallel Algorithms¹

Case study: parallel prefix sum
Packing
Quicksort
Merge sort

Case study: parallel prefix sum

In a previous exercise, we developed three algorithms for solving the partial sums problem.

add(int i, double y): Add the value y to the i^th number.
partialSum(int i): Return the sum of the first i numbers, A[0] through A[i - 1].

A related problem in computer science is known as prefix sum: given an array of \(N\) integers, return a new array where each element is the sum of the first i elements in the input array. Consider the following linear-time sequential procedure implementing prefix sum.

static int[] prefixSum(int[] arr) {
    int[] result = new int[arr.length];
    result[0] = arr[0];
    for (int i = 1; i < arr.length; i += 1) {
        result[i] = result[i - 1] + arr[i];
    }
    return result;
}

Data dependency makes it challenging to parallelize this algorithm since result[i] depends on result[i - 1]. Similar to how the solution to parallel sum generalizes to parallel fold, the solution to parallel prefix sum generalizes to parallel pack (filter).

The algorithm works in two passes: an up pass that produces a complete binary tree (array representation) and a down pass that computes the prefix sum using the binary tree. Each node in the tree contains the sum of a range \([lo, hi)\) of integers in arr. (This notation includes arr[lo] but excludes arr[hi].)

Root node: \([0, N)\)
Left child: \([lo, mid)\)
Right child: \([mid, hi)\)
Leaf node: \([i, i + 1)\), which is just the value arr[i]. Or, use a sequential cutoff.

We can design a divide-and-conquer parallel algorithm for solving this problem using the fork/join framework. The input is recursively subdivided until the base case is reached. Then, at any non-leaf node, the results of the child nodes are used to compute the current node’s value. The range sums are recursively computed bottom-up, hence the up pass.

Parallel Prefix Sum Up Pass

What is the work and span of the up pass in terms of N?

The work and span for the up pass is similar to the execution DAG for parallel sum, so \(\Theta(N)\) work and \(\Theta(\log N)\) span.

After the up pass has constructed the binary tree, the down pass computes the prefix sum result. From the up pass, each node stores a range sum \([lo, hi)\). The down pass provides the partial prefix sum \([0, lo)\). Adding these two sums yields the full prefix sum \([0, hi)\).

The root starts with the range sum \([0, 0)\) (an empty range), which is 0.
The left child of a node gets the range sum \([0, lo_p)\) from its parent \(p\).
The right child of a node gets the range sum \([0, lo_p)\) from its parent \(p\) plus the range sum stored in its left sibling.

What is the work and span of the down pass in terms of N?

The work and span for the down pass is also similar to the execution DAG for parallel sum, so \(\Theta(N)\) work and \(\Theta(\log N)\) span.

Packing

Two-pass algorithms like parallel prefix sum inspire solutions to other more general problems. Given an input array, pack returns an array containing only the elements of the input satisfying some condition in the same order they appear in the input. For example, we might want to pack all of integers in an array with value greater than 10.

static int[] greaterThanTen(int[] arr) {
    int count = 0;
    for (int i = 0; i < arr.length; i += 1) {
        if (arr[i] > 10) {
            count += 1;
        }
    }
    int[] result = new int[count];
    int index = 0;
    for (int i = 0; i < arr.length; i += 1) {
        if (arr[i] > 10) {
            result[index] = arr[i];
            index += 1;
        }
    }
    return result;
}

Note the resulting index for each item that meets the condition depends on the count of items to its left that also meet the condition. This is exactly the problem that parallel prefix sum can solve. We can implement parallel pack in three steps.

Given an input array [17, 4, 6, 8, 11, 5, 13, 19, 0, 24], the expected result is [17, 11, 13, 19, 24].

Perform a parallel map to produce a bit vector where a 1 indicates the corresponding input element is greater than 10.
[1, 0, 0, 0, 1, 0, 1, 1, 0, 1]
Perform a parallel prefix sum on the bit vector to compute the resulting element indices.
[1, 1, 1, 1, 2, 2, 3, 4, 4, 5]
Perform a parallel map to produce the packed result. For each element in the input, store it in the result at the location specified by the prefix sum (minus 1) only if the element meets the condition in the bit vector (marked 1).
[17, 11, 13, 19, 24]

Quicksort

We can describe the work involved in sequential quicksort as \(T_1 \in \Theta(N \log N)\) when we assume that the pivot item partitions each subproblem evenly. This is derived from the recurrence relation \(R(N) = 2R(N / 2) + c_1 N + c_0\).

A simple way to speed up the algorithm is to solve the left and right subproblems in parallel. This leads an algorithm with a span in \(\Theta(N)\) given by the recurrence \(R(N) = R(N / 2) + c_1 N + c_0\). Though this is an improvement over the sequential algorithm, most of the time is spent on partitioning the array into two halves.

Describe how to use parallel pack to partition an array.

Use naive quicksort: create an extra array to store the partitioned result. Run one parallel pack to compute the partition of items less than the pivot. Then, place the pivot. Finally, run a second parallel pack to compute the partition of items greater than or equal to the pivot.

The span for parallel quicksort with parallel-pack partitioning is given by the recurrence \(R(N) = R(N / 2) + c_1 \log N + c_0\).

What is the closed form for the span of this optimized parallel quicksort?

The span is in \(\Theta(\log^2 N)\).

What is the parallelism of this algorithm?

The parallelism \(T_1 / T_\infty\) of the algorithm is exponential: \(N \log N / \log^2 N = N / \log N\).

Merge sort

The same parallelism challenge is evident in merge sort as well: the span for parallel merge sort is dominated by the time it takes to merge two sorted arrays. We can improve upon the linear time it takes to merge two sorted arrays by designing a faster parallel algorithm.

Determine the median element of the larger sorted array. (Select the item at the middle index.)
Use binary search to find the split point in the smaller array such that all elements to the left are less than the larger array’s median.
Recursively merge the smaller halves of both arrays.
Recursively merge the larger halves of both arrays.

What is the worst-case span of parallel merge?

The larger array is always split roughly in half, but the split point in the smaller array could be anywhere in that array. In the worst case, the split point could be either the smallest or largest item in the smaller array.

But since we know the larger array has at least \(N / 2\) items (by definition of being the larger array), the two recursive merge subproblems can only split the original \(N\) items as unevenly as \(N / 4\) and \(3N / 4\). The recurrence for the worst-case span is given by \(R(N) = R(3N / 4) + c_1 \log N + c_0\), which is in \(\Theta(\log^2 N)\).

What is the parallelism of this algorithm?

Plugging the parallel merge span of \(\Theta(\log^2 N)\) into the parallel merge sort recurrence yields \(R(N) = R(N / 2) + \Theta(\log^2 N) \in \Theta(\log^3 N)\).

The parallelism \(T_1 / T_\infty\) of the algorithm is \(N \log N / \log^3 N\).

Dan Grossman. 2016. A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency. https://homes.cs.washington.edu/~djg/teachingMaterials/spac/sophomoricParallelismAndConcurrency.pdf ↩

Multi-Pass Parallel Algorithms1

Table of contents

Case study: parallel prefix sum

Packing

Quicksort

Merge sort

Multi-Pass Parallel Algorithms¹