CSE 596 Assignment 2 Solution

University of Washington Department of Computer Science & Engineering

Parallel Computing

CSE Home

Syllabus

Programming with ZPL

Assignments

Assignment 1 (and soln)

Assignment 2 (html) (pdf) (ppt)

Assignment 3

Assignment 2 Solution

Assumptions
People made a variety of assumptions about the program, we tried to allow for these assumptions as best as possible. Here are some of the assumptions people made:

Messages per operation - a read or write (i.e. invalidate) requires some number of operations to perform. (e.g. an invalidate might be two messages, Invalidate and Acknowledge, a read might require four, IWantX, JoeHasX, HeyJoeWhatIsX, Its27). Some of you will notice little i=2 and r=2 notes on your papers, that a note to myself when I checked your math. In addition a few people counted "self messages" in their tally.
Cache size - Most people took advantage of our statement that you can assume the cache is large enough to hold all the data you'll need. A few people actually took into account the effects of a smaller cache.
Zero buffer - Most people took advantage of the "free" buffer of zero values around the matrix. A couple of people used the zero buffer but "charged" for using it (i.e. counted those reads as messages.) Although this is technically wrong (after the second iteration, every processor will have all the zero buffer data it needs in its local cache, and since the zeros never get changed, those cache lines will never be invalidated and will always be there), I didn't take off points for this. A couple of people did things like resize the matrix so that it was 96x96 after including the zero buffer. They justified this by arguing that having a 98x98 matrix (including zeros) would ruin cache alignment. This was a reasonable way to go as well. (Though 96x96 with a zero buffer doesn't necessarily mess up cache alignment, since the zero buffer could be stored elsewhere, sort of like a flood buffer.)
Wrapping - Most people just used the zero buffer, but a couple used a toroidal matrix instead, making communication necessary on all four edges of each submatrix.
Synchronized steady state computation - Most people assumed the computation took place one iteration at a time, and all processors were synced to be in the same iteration. (I think that this is actually required for correctness, unless you are very careful about synchronizing every write so that you don't write a value that hasn't yet been used by another processor that needs it for the previous iteration.) A couple of people analyzed the case where the processors got "out of phase".
For purposes of this solution, the assumptions we make are:

96x96 matrix plus a separate buffer of zero values (with cache alignment preserved)
Infinite cache
32 byte cache lines with 8 byte (e.g. double) values, so 4 values per cache line
2 messages per invalidate (i=2), 4 messages per remote read (r=4). I use these values primarily to ease readibility below. It will help distinguish how reads and invalidates contribute differentially to the total.
Data stored in row-major order within each submatrix
Data processed in row-major order (though it doesn't really matter with an infinite cache)
Self-messages (messages from a processor to itself) are not counted.
First, some terminology. I'll consider the following processor layout:
P0 P1 P2 P3
I'll be using -1 bounded matrices. I.e. the full logical matrix runs from -1 to 97 with rows (and columns) -1 and 97 being the zero buffers. Since no communication ever takes place for the zero buffers after the first two iterations, I'll essentially ignore these values. To be explicit, P0 is home for the data values from (-1, -1) to (47, 47), P1 has (-1, 48) to (47, 97), P2 has (48,-1) to (97,47) and P3 has (48,48) to (97,97).
First off, most people recognized that the computation was symmetric and computing messages for one processor and then multiplying by four was sufficient. I will analyze P0 (the upper left hand processor) and then multiply by four.
Before describing the actual messages, let's look at what parts of the computation access data from another processor.
The computation of all data values between (0, 0) and (46, 46) use only local values, and so involve no communication at all. The only commication is done in column 47 and row 47. For cells (0..46, 47), (i.e. in column 47, for each row from 0 to 46), three values are used from P1. For cells (47, 0..46), three values are used from P2. Finally for cell (47, 47), two values are used from P1, two from P2 and one from P3.
Some people left the analysis at that, but this can be decieving. Although three external values are used to compute each cell in the final column (for example), these values are read in and cached locally, so each one need only be communicated once. So, for example, both cells (0, 48) and (1,48) are read by P0 when computing cell (0,47). Now the data in (1,48) is also used to compute cell (1,47), but by then it's in P0's cache and there's no need to do another remote read. It won't be written to by P1 in this iteration (that's what S' is for), so the cached value will remain valid. So we're really concerned with the number of items that must be communicated, not the number of times they are used (or the order they are used in.) We can thank the infinite cache for this simplification. The total number of external values that are used is 49 from P1 (-1..47, 48), plus 49 from P2 (48, -1..47) plus 1 from P3 (48, 48).
Next we need to figure out how many cache-lines these data values are stored in. The column (-1..47, 48) is easy, there is one cache line involved per row. Similarly, the one value on P3 at (48, 48) will be in another cache line. However for the row (48, -1..47), the data are sequential, and so every four values are packed into a single cache line. (Except for (-1, 48) which is stored elsewhere.) Thus the row requires only 12 cache lines total.
Having determined the number of external values used and the number of cache lines required to hold them, we can tackle the issue of messages sent. In order to do this, we need to take a look at the state of the computation (in particular cache status) after the previous iteration has completed. In the n-1st iteration, we were using S' to read from and writing to S. All the "internal" cells, those that weren't involved in communication, are marked as modified in both S and S', but since they are never shared, we can continue modifying them with no impact on communication cost. The interesting cases are those cache lines containing data that was used in communication. The values in S were written in the previous iteration, so any off-processor copies were invalidated. This means that all off-processor values from S that are used in the current iteration will require reads. Similarly, in the previous iteration, S' was being read from. That means that all the off-processor accesses to these values were recorded as shared reads in the value's home processor. Thus in S, every cache line described above requires one read and in S' every cache line requires one invalidate per sharer. This is all with the exception of the zero buffer rows and columns. These generate no messages at all.
So we can now compute the total number of messages required. There are 48 reads sent from P0 to P1 to read the values in the shared column of data, 12 reads sent from P0 to P2 to read the values in the shared row and one read sent from P0 to P3 to read the single value on the diagonal. Similarly, there are 48 + 12 + 1 invalidates sent from P0 to other processors to invalidate the shared values in S'. So if r is the number of messages required for a read (in our case, 4), and i is the number required for an invalidate (in our case, 2), then the number of messages for a single processor is
(48 + 12 + 1) * r + (48 + 12 + 1) * i
= 61 * r + 61 * i
= 61 * (r + i)
= 61 * (4 + 2)
= 61 * 6
= 366
Multiply this value by 4 for the four processors and you get 366 * 4 = 1464 messages for the entire computation.
Common mistakes
The most common mistakes people made were:

Double counting values that are cached locally - some people didn't take into account that once a value is read from a neighboring processor, it is cached locally and doesn't need to be reread even if it gets used again.
Not taking the size of the cache line into account - some people didn't divide the final row of the sub-matrix by 4 to take into account the sharing taking place there.
Not handling invalidates - some people computed the number of read messages sent, but didn't recognize that invalidations must also be sent when S' is written to.

Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to carlson]