Reconsider the Matrix Multiplication
If every processor had a copy of the A,B matrices, each could compute a rectangular subarray
- Memory footprint would be huge, P(mn+np) + Cr
- Transfer time of arrays to each memory would be ?(mn+np), also huge
- Optimization -- C[i..i+x,j..j+y] requires rows i..i+x and columns j..j+y
- Total numeric operations would be O(mpn) which should benefit from a P-way speedup