van de Geijn and Watts’ SUMMA
•Scalable Universal MM Algorithm
•Claimed to be the best practical algorithm
•Uses overlap, pipelining, decomposition …
–Initialize C, blocking all arrays the same
•broadcast (segment of) 1st A column to processors in row
•broadcast (segment of) 1st B row to processors in column
–for i = 2 through n
•broadcast (segment of) next A column to all processors in row
•broadcast (segment of) next B row to all processors in column
•compute i-1 term in dot product for all elements of block
–compute last term for all elements of block