













- There are two ways ZPL exploits blocked data transfer
  - Vectorization moves array slices as a single unit -- ZPL naturally vectorizes because it is compiling array operations
  - Combining communications to the same destination reduces the overhead, benefits from pipelining
- Communication is also pipelined, allowing communication to overlap with computation
- · Goals of combining and pipelining can conflict

Copyright, Law



















### Basics of Denelcor HEP

- First interleaved multithreaded machine (78-85)
- Each processor had 64 user contexts and 64 privileged contexts, 128-way replicated register file and state
- Contention-free memory (20-40 cycles) in a dancehall design
- Processor had 8 deep pipeline, but only one memory, branch or divide could be in pipe at a time

Copyright, Lawrence Snyder, 1999

Copyright, Lawrence Snyder

Basics Of Tera Design

Instructions are [arithmetic, control, memory] or [arithmetic, arithmetic, memory]

- Ready instructions issue on each tick, but there is a 16 tick minimum issue delay for consecutive instructions from a thread
- Each (memory) instruction has a 3 bit tag telling how many instructions forward are independent of this memory reference
- Average memory latency w/o contention 70 cycles

## More On Tera

20

- Since there is a 16 instruction minimum issue it takes 16 threads to keep utilize the processor without hiding latency
- Each processor has 128 fully replicated contexts
- Synchronization latency can even be covered
- When everything works, the Tera should approximate a PRAM





### Latency Tolerance Summary

- Two main approaches: blocked & interleaved
- Approaches differ in their single thread performance
- It may be tough to find all those threads w/o language or programmer assistance
- Programming on the assumption of aggressive latency tolerance may yield a very unportable program
- Some further discussion in Section 11.7

25

## Reading

- J. T. Schwartz, Ultracomputers, ACM ToPLAS
- Valiant BSP
- Sung-Eun Choi, "Machine Independent Communication Optimization", PhD Dissertation, University of Washington, 1999
- B. J. Smith, Architecture and Applications of the HEP Multiprocessor, Proc. SPIE: Real Time Signal Processing IV 298, pp 241-248



# Parallel Algorithms: LU Decomposition

- Solving systems of linear equations is a critical part of many scientific computations
- Recall that the standard solution "marches" to the lower right corner of the matrix, leading to poor load balance











# N-body (Barnes Hut) Algorithm Construct the tree Compute the attractions of the other points by traversing the tree; at a node, if the bodies are close, computer pairwise attractions; if they are distant, compute approximation and do not traverse any lower Totality of attractions induces a new position Variations - Alternative tree structures Salmon uses an out of core algorithm using a space filling curve to promote locality

32

aht, Lawrence Snyder, 199