Scalable
Universal MM Algorithm
Claimed
to be the best practical algorithm
Uses
overlap, pipelining, decomposition
Initialize
C, blocking all arrays the same
broadcast
(segment of) 1st A column to processors in row
broadcast
(segment of) 1st B row to processors in column
for
i = 2 through n
broadcast
(segment of) next A column to all processors in row
broadcast
(segment of) next B row to all processors in column
compute
i-1 term in dot product for all elements of block
compute
last term for all elements of block