Last Updated: July 15, 2010 by Dan Grossman

Notes on the Java ForkJoin Framework for CSE332

Introduction

In CSE332, we are going to learn some basic parallel programming. While these basics apply generally, our project will use a Java library called the "ForkJoin Framework" or sometimes "JSR-166," which is the "code name" for this library's addition into Java.

Similar libraries/languages exist for other environments including Cilk and Intel's Thread Building Blocks (TBB) for C/C++ and the Task Parallel Library for C#. We will focus on using this library, not on how it is implemented. The implementation uses several elegant data structures and algorithms, most notably double-ended work-stealing queues, to provide optimal expected run-time under reasonable assumptions and any number of available processors. However, the implementation and the asymptotic analysis are not simple, so we will say little more about them.

The ForkJoin Framework will be part of Java 7's standard libraries, but Java 7 is not yet released. Under Java 6, an implementation that will meet our needs is available, but you need to follow a few steps to download and use it. We will assume you are using Eclipse 3.51 -- it should not be difficult to understand how to modify the instructions for other environments. We will then describe the 3 or 4 classes you will need in CSE332 and show a simple program. Finally, we will mention a few complications in case you stumble into them.

Documentation

The main web site for JSR-166 is http://gee.cs.oswego.edu/dl/concurrency-interest/index.html. It has much more information than you need for CSE332, which is why we have distilled the basics into these notes. For the javadoc, see http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166ydocs/.

Installation

We have included a copy of jsr166.jar in the files for project 3. Newer versions are released occasionally and posted at http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166.jar, so don't use another copy that is more than a few months old. To create a project that uses the library, you can follow the steps below in order. There are alternatives for some of these steps (e.g., you could put the .jar file in a different directory), but these should work.

Outside of Eclipse, make a new directory (folder) and put jsr166.jar and other relevant Java code in it.
In Eclipse, create a new Java project. Choose "Create project from existing source" and choose the directory you created in the previous step.
In the project, choose Project → Properties from the menu. Under "Java Compiler" check "Enable project specific settings" and make sure the choice is Java 1.6 (earlier versions will not work).
In the list of files in the package viewer (over on the left), right-click on jsr166.jar and choose "Add to Build Path."
Make a new class for your project. Define a main method you can run.
Under Run → Configurations, create a new configuration. Under Arguments, you are used to putting program arguments in the top and that is as usual. But also under "VM arguments" on the bottom you need to enter: -Xbootclasspath/p:jsr166.jar exactly like that.

If you instead run javac and java from a command-line, you need jsr166.jar to be in your build path when you compile and you need -Xbootclasspath/p:jsr166.jar as an option when you run java.

Getting Started

There are only 2-4 classes you even need to know about:

ForkJoinPool: you create exactly one of these to run all your fork-join tasks in the whole program
RecursiveTask<V>: You run a subclass of this in a pool and have it return a result. See the examples below.
RecursiveAction: just like RecursiveTask except it does not return a result.
ForkJoinTask<V>: superclass of RecursiveTask<V> and RecursiveAction. fork and join are methods defined in this class. You won't use this class directly, but it is the class with most of the useful javadoc documentation.

For documentation, see http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166ydocs/, but these notes are an attempt to include everything you need to know.

All the classes are in the package java.util.concurrent, so the simplest thing to do is have import statements like this:

  import java.util.concurrent.ForkJoinPool;
  import java.util.concurrent.RecursiveTask;

To use the library, the first thing you do is create a ForkJoinPool. You should only do this once -- there is no good reason to have more than pool in your program. It is the job of the pool to take all the tasks that can be done in parallel and actually use the available processors effectively. A static field holding the pool works great:

  public static ForkJoinPool fjPool = new ForkJoinPool();

(The default constructor is for when you want the pool to use all the processors made available to it. That is a good choice.)

If you can compile and run a "Hello, World!" program that includes the field declaration above, then you followed the installation instructions above correctly. Of course, you are not actually using the pool yet.

A Useless Example

To use the pool you create a subclass of RecursiveTask<V> for some type V (or you create a subclass of RecursiveAction). In your subclass, you override the compute() method. Then you call the invoke method on the ForkJoinPool passing an object of type RecursiveTask<V>. Here is a dumb example:

// define your class
class Incrementor extends RecursiveTask<Integer> {
   int theNumber;
   Incrementor(int x) {
     theNumber = x;
   }
   Integer compute() {
     return theNumber + 1;
   }
}
// then in some method in your program use the global pool we made above:
int fortyThree = fjPool.invoke(new Incrementor(42));

The reason this example is dumb is there is no parallelism. We just hand an object over to the pool, the pool uses some processor to run the compute method, and then we get the answer back. We could just as well have done:

   int fortyThree = (new Incrementor(42)).compute();

Nonetheless, this dumb example shows one nice thing: the idiom for passing data to the compute() method is to pass it to the constructor and then store it into a field. Because you are overriding the compute method, it must take zero arguments and return Integer (or whatever type argument you use for RecursiveTask).

A Useful Example

The key for non-dumb examples, which is hinted at nicely by the name RecursiveTask, is that your compute method can create other RecursiveTask objects and have the pool run them in parallel. First you create another object. Then you call its fork method. That actually starts parallel computation -- fork itself returns quickly, but more computation is now going on. When you need the answer, you call the join method on the object you called fork on. The join method will get you the answer from compute() that was figured out by fork. If it is not ready yet, then join will block (i.e., not return) until it is ready. So the point is to call fork "early" and call join "late", doing other useful work in-between.

Those are the "rules" of how fork, join, and compute work, but in practice a lot of the parallel algorithms you write in this framework have a very similar form, best seen with an example. What this example does is just sum all the elements of an array, but uses parallelism to potentially do different 5000-element segments in parallel. (The types long / Long are just like int / Integer except they are 64 bits instead of 32. They can be a good choice if your data can be large -- a sum could easily exceed 2^32, but exceeding 2^64 is less likely.)

static final ForkJoinPool fjPool = new ForkJoinPool();
static final int SEQUENTIAL_THRESHOLD = 5000;

class SumArray extends RecursiveTask<Long> {
   int low;
   int high;
   int[] array;

   SumArray(int[] arr, int lo, int hi) {
      array = arr;
      low = lo;
      high = hi;
   }

   protected Long compute() {
      if(high - low <= SEQUENTIAL_THRESHOLD) {
         long sum = 0;
         for(int i=low; i < high; ++i)
            sum += array[i];
         return sum;
      } else {
          int mid = low + (high - low) / 2;
          SumArray left  = new SumArray(array, low, mid);
          SumArray right = new SumArray(array, mid, high);
          left.fork();
          int rightAns = right.compute();
          int leftAns  = left.join();
          return leftAns + rightAns;
      }
    }
}

long sumArray(int[] array) {
  return fjPool.invoke(new SumArray(array,0,array.length));
}

How does this code work? A SumArray object is given an array and a range of that array. The compute method sums the elements in that range. If the range has fewer than SEQUENTIAL_THRESHOLD elements, it uses a simple for-loop like you learned in CSE142. Otherwise, it creates two SumArray objects for problems of half the size. It uses fork to compute the left half in parallel with computing the right half, which this object does itself by calling right.compute(). To get the answer for the left, it calls left.join().

Why do we have a SEQUENTIAL_THRESHOLD? It would be correct instead to keep recurring until high==low+1 and then return array[low]. But this creates a lot more SumArray objects and calls to fork, so it will end up being much less efficient despite the same asymptotic complexity.

Why do we create more SumArray objects than we are likely to have procesors? Because it's the framework's job to make a reasonable number of parallel tasks execute efficiently and to schedule them in a good way. By having lots of fairly small parallel tasks it can do a better job, especially if the number of processors available to your program changes during execution (e.g., because the operating system is also running other programs) or the tasks end up taking different amounts of time.

So setting SEQUENTIAL_THRESHOLD to a good-in-practice value is a trade-off. The documentation for the ForkJoin framework suggests creating parallel subtasks until the number of basic computation steps is somewhere over 100 and less than 10,000. The exact number is not crucial provided you avoid extremes.

Gotchas

There are a few "gotchas" when using the library that you might need to be aware of:

It might seem more natural to call fork twice for the two subproblems and then call join twice. You can understand that this would be less efficient than just calling compute for no benefit since you are creating more parallel tasks than is helpful. It turns out to be a lot less efficient in your instructor's experience, for reasons that are not entirely clear.
Remember that calling join blocks until the answer is ready. So if you look at the code:
```
	left.fork();
	int rightAns = right.compute();
        int leftAns  = left.join();
        return leftAns + rightAns;
```
you'll see that the order is crucial. If we had written:
```
	left.fork();
        int leftAns  = left.join();
        int rightAns = right.compute();
	return leftAns + rightAns;
```
our entire array-summing algorithm would have no parallelism since each step would completely compute the left before starting to compute the right. Similarly, this version is non-parallel because it computes the right before starting to compute the left:
```
	int rightAns = right.compute();
	left.fork();
        int leftAns  = left.join();
        return leftAns + rightAns;
```
If an exception is caused by the code running in parallel (either in compute or in some method it calls (yes, helper methods are a good idea for more complicated parallel tasks!), the debugger will not be as much help to you because the call-stack gets "lost" when the exception gets propagated up to you. Catching the exception in compute() and printing the stack trace would probably work.
You should not use the invoke method of a ForkJoinPool from within a RecursiveTask or RecursiveAction. Instead you should always call compute or fork directly even if the object is a different subclass of RecursiveTask or RecursiveAction. You may be conceptually doing a "different" parallel computation, but it is still part of the same DAG of parallel tasks. Only sequential code should call invoke to begin parallelism.
In terms of performance, if you just run one small-ish fork-join task, the code might run slower than you expect. That is because the framework itself is just Java code and the Java implementation "takes a while" to decide this code is performance critical and optimize it. Do not worry about this unless you are doing experiments to measure the performance of your parallel code. In that case, you should probably run your code in a loop and collect timing only after a few loop iterations.