Different
techniques illustrated --
Decompose
into independent tasks
Pipelining
Overlapping
computation and communication
Optimizations
Enlarge
task size, e.g. several rows/columns at once
Improve
caching by blocking
Reorder
computation to use data once
Exploit
broadcast communication