Link

Parallel Algorithms1

Table of contents

  1. Case study: parallel sum
    1. Divide-and-conquer
    2. Fork/join framework
  2. Folding and mapping

In sequential programming, one thing happens at a time. Removing the one-thing-at-a-time assumption complicates writing software. The multiple threads of execution (things performing computations) will somehow need to coordinate so that they can work together to complete a task—or at least not get in each other’s way while they are doing separate things. These notes cover basic concepts related to multithreaded programming, i.e. programs with multiple threads of execution. We will cover:

  • How to create multiple threads, each with their own local variables but sharing access to objects.
  • How to write and analyze divide-and-conquer algorithms that use threads to produce results more quickly.
  • How to coordinate access to shared objects so that multiple threads using the same data do not produce the wrong answer.

A useful analogy is with cooking. A sequential program is like having one cook who does each step of a recipe in order, finishing one step before starting the next. Often there are multiple steps that could be done at the same time—if you had more cooks. Parallel programming is about using additional computational resources to produce an answer faster.

Each step of the recipe might have divisible subproblems, such as cutting potatoes. These divisible subproblems can be assigned to a thread that is then executed by available processors. A single cook can cut all of the potatoes (threads) on their own, but multiple cooks (processors) can cut all of the potatoes in parallel.

But having more cooks requires extra coordination. One cook may have to wait for another cook to finish something since it might not be possible to move onto the next step of the recipe until all of the potatoes have been cut. Computational resources might also be limited: if you have only one oven, two cooks won’t be able to bake casseroles at different temperatures at the same time. In short, multiple cooks present efficiency opportunities, but also significantly complicate the process of producing a meal. Concurrent programming is about correctly and efficiently controlling access by multiple threads to shared resources.

In practice, the distinction between parallelism and concurrency is not absolute. Many programs have aspects of each. Suppose you had a huge array of values you wanted to insert into a hash table. From the perspective of dividing up the insertions among multiple threads, this is about parallelism. From the perspective of coordinating access to the hash table, this is about concurrency. Also, parallelism does typically need some coordination: even when adding up integers in an array we need to know when the different threads are done with their chunk of the work.

Case study: parallel sum

Consider this O(N) sequential procedure for summing all the values in an array.

static int sum(int[] arr) {
    int result = 0;
    for (int i = 0; i < arr.length; i += 1) {
        result += arr[i];
    }
    return result;
}

If the array is large and we have extra processors available, we can get a more efficient parallel algorithm. Suppose we have 4 processors. We could design a parallel algorithm using 4 threads so that each processor sums a different quarter of the array and individually stores their result. Once all 4 threads have executed, we can then sum together the 4 results. More generally, if we have P processors, we can divide the array into P equal segments and have an algorithm that runs in time O(N / P + P) where N / P is for the parallel part and P is for combining the stored results. However, this approach is problematic for three reasons.

  1. Forward portability. Different computers have different numbers of processors, so hard-coding a solution for exactly P processors will not scale.
  2. Processor availability. The processors available to part of the code can change since the operating system can reassign processors at any time, even when we are in the middle of summing array elements.
  3. Load imbalance. Though unlikely for sum, in general, subproblems may take very different amounts of time to compute.

To address these challenges, we will use substantially more threads than there are processors. Rather than equally dividing the array into 4 quarters, create a thread for each 1000 elements. Assuming a large enough array, the threads will not all run at once since a processor can run at most one thread at a time. The operating system will keep track of what threads are waiting and keep all the processors busy until all the threads have executed.

Divide-and-conquer

Consider the following divide-and-conquer parallel algorithm for summing all the array elements in some range from lo to hi.

  1. If the range contains at most 1000 elements, return the sum of the range directly.
  2. Otherwise, in parallel:
    1. Recursively sum the elements from lo to the middle of the range.
    2. Recursively sum the elements from the middle of the range to hi.
  3. Add the two results from the previous step.

Parallel Sum

If we have N processors, then the order of growth of the runtime for this algorithm is the height of the tree: logarithmic!

Fork/join framework

Java has a built-in library for efficiently solving this kind of divide-and-conquer parallelism. Each summing subproblem is represented as a SumTask instance.

import java.util.concurrent.RecursiveTask;

class SumTask extends RecursiveTask<Integer> {
    static int SEQUENTIAL_THRESHOLD = 1000;

    int lo;
    int hi;
    int[] a;

    SumTask(int[] arr, int l, int h) { lo = l; hi = h; a = arr; }

    public Integer compute() {
        if (hi - lo <= SEQUENTIAL_THRESHOLD) {
            // Solve the current problem directly
            int result = 0;
            for (int i = lo; i < hi; i += 1) {
                result += a[i];
            }
            return result;
        } else {
            // Split the current problem into two subproblems
            int mid = (hi + lo) / 2;
            SumTask left  = new SumTask(a, lo, mid);
            SumTask right = new SumTask(a, mid, hi);

            // Magic method call: fork a thread to compute the left
            left.fork();
            // Normal recursive call: compute the right on this thread
            int rightResult = right.compute();
            // Magic method call: wait to store the result from the left
            int leftResult = left.join();

            return leftResult + rightResult;
        }
    }
}
What happens if we call left.join() before right.compute()?

The program still returns the correct result, but now everything runs sequentially!

Note that because Java expressions are always evaluated left-to-right, we could replace the last 3 lines of the else branch in compute with just return right.compute() + left.join();. But this is rather subtle: left.join() + right.compute() would destroy the parallelism.

In order to start the parallel computation, we need to invoke a SumTask from a ForkJoinPool instance. The entire program should have exactly one ForkJoinPool that represents all of the processing resources available to the fork/join framework. This can be done by calling ForkJoinPool.commonPool().

import java.util.concurrent.ForkJoinPool;

class Main {
    static int sum(int[] arr) {
        SumTask task = new SumTask(arr, 0, arr.length);
        int result = ForkJoinPool.commonPool().invoke(task);
        return result;
    }
}

Folding and mapping

The SumTask class is just one instance of a more general higher-order function known as a fold (also known as “reduce”, different from reductions in algorithm theory). A fold takes a collection of data and return a single result. Just as we can find the sum of all values in an array, we can just as well find the max, min, or count of a particular item in an array by switching out the addition operator. However, only associative operations work. Addition is associative because a + (b + c) = (a + b) + c.

In addition to parallel folds, parallel map is another general higher-order function. A map performs an operation on each item independently: given an input array, it produces an output array of the same length. A simple example would be multiplying every element of an array by 2.

  1. Dan Grossman. 2016. A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency. https://homes.cs.washington.edu/~djg/teachingMaterials/spac/sophomoricParallelismAndConcurrency.pdf