Parallel Algorithm Analysis1
The runtime for a parallel algorithm depends not only on the size of the input, \(N\), but also the number of processors, \(P\). We define \(T_P\) to be the time an algorithm takes to run if there are \(P\) processors available during its execution. For example, if an algorithm was the only one running on a quad-core machine, we would be particularly interested in \(T_4\). We care about two special cases in particular.
- \(T_1\) is the work: how long it takes to run on one processor.
- \(T_\infty\) is the span, though other common terms are the critical path length or computational depth. This is how long it takes to run on an unlimited number of processors. This is not necessarily constant time: even with an unlimited number of processors,
SumTaskhas \(T_\infty \in \Theta(\log N)\).
In order to model the runtime of a parallel algorithm, we will describe its execution as a directed acyclic graph (DAG).
- Nodes represent pieces of work performed by the algorithm, such as summing the result of two smaller tasks. Each node is a constant-time operation. The order of growth of \(T_1\) is given by the number of nodes in the DAG.
- Edges represent computational dependencies: which nodes rely on the results of other nodes. The order of growth of \(T_\infty\) is given by the length of the longest path in the DAG.
How do we compute the longest path in a DAG?
Recall that we can perform a reduction to the shortest path in a DAG.
- Set the weight of each edge to -1.
- Run the DAG shortest paths algorithm from the source vertex.
- Return the shortest path. This is the longest path in the original DAG.
For the same reason why we don’t solve runtime recurrences using tree traversals, we typically use an analytical approach (i.e. reasoning with mathematics) rather than this computational approach to determine the runtime of parallel algorithms.
The source node at the top of the DAG represents the computation that divides the array into two equal halves. The sink node at the bottom of the DAG represents the computation that adds together the two sums from the equal halves to produce the final answer. The base cases represent processing a constant number of items in the array. In this DAG, \(T_1 \in \Theta(N)\) while \(T_\infty \in \Theta(\log N)\).
Having defined work and span, we can use them to define some other terms more relevant to our real goal of reasoning about \(T_P\).
- \(T_1 / T_P\) is the speedup on \(P\) processors: the ratio of how much faster the algorithm runs given the extra processors.
- Perfect linear speedup
- Perfect linear speedup occurs when \(T_P = T_1 / P\): for example, doubling the number of processors \(P\) results in halving the runtime. In practice, computers don’t achieve perfect linear speedup.
- \(T_1 / T_\infty\) (work/span) is the parallelism of an algorithm: how much improvement to expect in the best possible scenario as the input size increases. For example, the parallelism of
SumTaskis exponential: as N increases, the relative improvement increases exponentially as given by \(N / \log N\).
Divide-and-conquer algorithms written using the fork/join framework have the following expected time bound: \(T_P \in O(T_1 / P + T_\infty)\).
For a small number of processors (P), which term in the sum is more likely to dominate?
For small \(P\), \(T_1 / P\) is likely to grow faster than \(T_\infty\). This is the best-possible perfect linear speedup. Constant factors are accounted for using the big-O notation.
For large \(P\), \(T_\infty\) is likely to grow faster than \(T_1 / P\). This is the span of the parallel algorithm.
Thinking in terms of the algorithm-execution DAG, it is rather amazing that a library can achieve this optimal result. While the algorithm is running, it is the framework’s job to choose among all the threads that could run next (not blocked waiting for some other thread to finish) and assign \(P\) of them to processors. In order to achieve this runtime bound, it’s necessary for us as the framework client to design parallel algorithms that scale with the amount of computational resources. In our parallel algorithm analysis, the cost model for computational resources is just \(P\) but things get more complicated in practice. For instance, divide-and-conquer style parallel algorithms need to specify a reasonable sequential cutoff somewhere in the range of 5,000 basic operations (e.g. arithmetic operations, array operations, and method calls).
Oftentimes, not all parts of our algorithms can be parallelized. It is common to think of an algorithm-execution DAG in terms of some entirely parallel parts (e.g. maps and folds) interwoven with some entirely sequential parts. The sequential parts could simply be algorithms that have not been parallelized or they could be inherently sequential, like processing data in a linked list or binary heap data structure.
However, even a little bit of sequential work drastically reduces the parallelism of an algorithm. Suppose we have an algorithm with \(S\)-percent sequential work and \((1 - S)\)-percent perfectly parallel work. The speedup on \(P\) processors is given by Amdahl’s law.
- Speedup (Amdahl’s law)
- \(T_1 / T_P = 1 / (S + (1 - S) / P)\)
- Parallelism (Amdahl’s law)
- \(T_1 / T_\infty = 1 / S\)
If 10% of an algorithm is sequential, what is the parallelism of the algorithm?
Even with an infinite number of processors, the speedup is at most 10. No matter how many processors are available, the parallelism of the algorithm is limited by the sequential portions.
Adding a second or third processor can often provide significant speed-up, but as the number of processors grows, the benefit quickly diminishes. In other words, in parallel algorithms, span is more important than work.
Even though parallelism is limited by the sequential portions of an algorithm, there are two common workarounds to Amdahl’s law.
- Develop better parallel algorithms
- Given enough processors, it is worth parallelizing something (reducing span) even if it means more total computation (increasing work). This is often described as “scalability matters more than performance” where scalability is defined in terms of \(P\).
- Solve larger instances of problems
- Suppose the parallel part of an algorithm is in \(O(N^2)\) while the sequential part is in \(O(N)\). With more processors available (increasing \(P\)), we can solve larger problem instances (increasing \(N\)) with only a relatively small increase in running time. This has been an important principle in computer graphics to render larger and more accurate graphics as well as in machine learning to increase the complexity of deep neural networks. Both of these applications use specialized processing units containing hundreds or thousands of processors each.
Dan Grossman. 2016. A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency. https://homes.cs.washington.edu/~djg/teachingMaterials/spac/sophomoricParallelismAndConcurrency.pdf ↩