GENERAL INFORMATION:

The LU program factors a dense matrix into the product of a lower 
triangular and an upper triangular matrix.  The factorization uses
blocking to exploit temporal locality on individual submatrix elements.
The algorithm used in this implementation is described in 

Woo, S. C., Singh, J. P., and Hennessy, J. L.  The Performance Advantages 
     of Integrating Block Data Transfer in Cache-Coherent Multiprocessors.
     Proceedings of the 6th International Conference on Architectural
     Support for Programming Languages and Operating Systems (ASPLOS-VI),
     October 1994.

Two implementations are provided in the SPLASH-2 distribution:

  (1) Non-contiguous block allocation

      This implementation (contained in the non_contiguous_blocks 
      subdirectory) implements the matrix to be factored with a 
      two-dimensional array.  This data structure prevents blocks from 
      being allocated contiguously, but leads to a conceptually simple
      programming implementation.

  (2) Contiguous block allocation

      This implementation (contained in the contiguous_blocks 
      subdirectory) implements the matrix to be factored as an array
      of blocks.  This data structure allows blocks to be allocated 
      contiguously and entirely in the local memory of processors that
      "own" them, thus enhancing data locality properties.

These programs work under both the Unix FORK and SPROC models.

RUNNING THE PROGRAM:

To see how to run the program, please see the comment at the top of the
file lu.C, or run the application with the "-h" command line option.
Three parameters may be specified on the command line, of which the 
ones that are normally changed are the matrix size and the number of 
processors.  It is suggested that the block size be kept at the value
B=16, since this value works well in practice.  If this parameter is 
changed, the new value should be reported in any results that are 
presented.

BASE PROBLEM SIZE:

The base problem size for an upto-64 processor machine is a 512x512 matrix
with a block size of B=16.

DATA DISTRIBUTION:

Our "POSSIBLE ENHANCEMENT" comments in the source code tell where one
might want to distribute data and how.  Data distribution has a small 
impact on performance on the Stanford DASH multiprocessor.