|CSE Home||Other Quarters||ZPL||About Us||Search||Contact Info|
Assignment 2 Solution
AssumptionsPeople made a variety of assumptions about the program, we tried to allow for these assumptions as best as possible. Here are some of the assumptions people made:
P0 P1 P2 P3I'll be using -1 bounded matrices. I.e. the full logical matrix runs from -1 to 97 with rows (and columns) -1 and 97 being the zero buffers. Since no communication ever takes place for the zero buffers after the first two iterations, I'll essentially ignore these values. To be explicit, P0 is home for the data values from (-1, -1) to (47, 47), P1 has (-1, 48) to (47, 97), P2 has (48,-1) to (97,47) and P3 has (48,48) to (97,97).
First off, most people recognized that the computation was symmetric and computing messages for one processor and then multiplying by four was sufficient. I will analyze P0 (the upper left hand processor) and then multiply by four.
Before describing the actual messages, let's look at what parts of the computation access data from another processor.
The computation of all data values between (0, 0) and (46, 46) use only local values, and so involve no communication at all. The only commication is done in column 47 and row 47. For cells (0..46, 47), (i.e. in column 47, for each row from 0 to 46), three values are used from P1. For cells (47, 0..46), three values are used from P2. Finally for cell (47, 47), two values are used from P1, two from P2 and one from P3.
Some people left the analysis at that, but this can be decieving. Although three external values are used to compute each cell in the final column (for example), these values are read in and cached locally, so each one need only be communicated once. So, for example, both cells (0, 48) and (1,48) are read by P0 when computing cell (0,47). Now the data in (1,48) is also used to compute cell (1,47), but by then it's in P0's cache and there's no need to do another remote read. It won't be written to by P1 in this iteration (that's what S' is for), so the cached value will remain valid. So we're really concerned with the number of items that must be communicated, not the number of times they are used (or the order they are used in.) We can thank the infinite cache for this simplification. The total number of external values that are used is 49 from P1 (-1..47, 48), plus 49 from P2 (48, -1..47) plus 1 from P3 (48, 48).
Next we need to figure out how many cache-lines these data values are stored in. The column (-1..47, 48) is easy, there is one cache line involved per row. Similarly, the one value on P3 at (48, 48) will be in another cache line. However for the row (48, -1..47), the data are sequential, and so every four values are packed into a single cache line. (Except for (-1, 48) which is stored elsewhere.) Thus the row requires only 12 cache lines total.
Having determined the number of external values used and the number of cache lines required to hold them, we can tackle the issue of messages sent. In order to do this, we need to take a look at the state of the computation (in particular cache status) after the previous iteration has completed. In the n-1st iteration, we were using S' to read from and writing to S. All the "internal" cells, those that weren't involved in communication, are marked as modified in both S and S', but since they are never shared, we can continue modifying them with no impact on communication cost. The interesting cases are those cache lines containing data that was used in communication. The values in S were written in the previous iteration, so any off-processor copies were invalidated. This means that all off-processor values from S that are used in the current iteration will require reads. Similarly, in the previous iteration, S' was being read from. That means that all the off-processor accesses to these values were recorded as shared reads in the value's home processor. Thus in S, every cache line described above requires one read and in S' every cache line requires one invalidate per sharer. This is all with the exception of the zero buffer rows and columns. These generate no messages at all.
So we can now compute the total number of messages required. There are 48 reads sent from P0 to P1 to read the values in the shared column of data, 12 reads sent from P0 to P2 to read the values in the shared row and one read sent from P0 to P3 to read the single value on the diagonal. Similarly, there are 48 + 12 + 1 invalidates sent from P0 to other processors to invalidate the shared values in S'. So if r is the number of messages required for a read (in our case, 4), and i is the number required for an invalidate (in our case, 2), then the number of messages for a single processor is
(48 + 12 + 1) * r + (48 + 12 + 1) * i
Multiply this value by 4 for the four processors and you get 366 * 4 = 1464 messages for the entire computation.
Common mistakesThe most common mistakes people made were:
Department of Computer Science & Engineering|
University of Washington
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to carlson]