| HIGH PERFORMANCE                                                                                                                                                                                                                                | COMPUTER ARCHIT                                                                                                                                                                                                                                                          | ECTURE final                                                                                                                      | exam 23-12-2020                                                                                                                                                                                                         | MATR.NO                                                                                                                                         |                                                                |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|
|                                                                                                                                                                                                                                                 |                                                                                                                                                                                                                                                                          |                                                                                                                                   |                                                                                                                                                                                                                         | SURNAME                                                                                                                                         |                                                                |
|                                                                                                                                                                                                                                                 |                                                                                                                                                                                                                                                                          |                                                                                                                                   |                                                                                                                                                                                                                         | FIRST NAME                                                                                                                                      |                                                                |
| <ul> <li>Processor Operations: Red<br/>Bus Transactions: Read a<br/>memory), Write back a bl</li> <li>There are 4 states: 1) the<br/>just loaded (<b>D</b> – Dirty); 3<br/>has updated it after loadin<br/>processors issued exactly</li> </ul> | ead ( <b>PrRd</b> ) and Write ( <b>Pr</b><br>data-block from Memory<br>ock onto the bus ( <b>Flush</b> –<br>copy is not valid ( <b>I</b> – Inva<br>) the copy has been just lo<br>g the copy); 4) the copy is<br>one update to that block).<br>which is activated by rem | Wr).<br>(BusRd), Propagate<br>the data block goes t<br>lid); 2) the copy is in<br>aded in the current ca<br>s shared but some oth | nce protocol with the followin<br>a PrWr on the bus ( <b>BusUpd</b><br>to memory).<br>one cache only and we assur<br>ache and other copies exist in<br>her processor had issued a wr<br>onse to a bus transaction to in | – a single data-word goes to<br>me that it has been modified<br>a other caches ( <b>S0</b> – shared a<br>ite into that block ( <b>S1</b> – shar | even if it has been<br>ind no other processor<br>red and other |
| critical section. The initial of                                                                                                                                                                                                                | condition is such that proc<br>ets the lock once and exits<br>lw R1, mylock #<br>bne R1, R0, Lock #                                                                                                                                                                      | essor 1 has the lock a<br>s the program. These<br>R1 = &mylock<br># if (R1 != 0) jumn<br># atomically_do {R2                      | <pre>1 = &amp;mylock mylock = 1;}</pre>                                                                                                                                                                                 | spinning on their caches wai<br>ne lock and unlock:                                                                                             |                                                                |

Note1: the semantic of the TAS (Test And Set) instruction is the following: atomically reads the specified memory location (mylock) and writes a one into that memory location (mylock). Note2: this implementation of the Lock tries to minimize the probability to have the bus locked by the TAS (this implementation is also known as Test-and-Test-and-Set). Note3: the lock is closed when

Bus Transactions/Comments

Initially, P1 holds the lock

BusUpd - P1 releases the lock

Bus Transactions/Comments

Initially, P1 holds the lock

BusUpd - P1 releases the lock

By using the following tables, show the operations and bus transactions (or comments) in the best case (least number of transactions) and in the worst

# write 0 into &mylock

mylock==1 and it is open when mylock==0. Also assume that NO write is propagated on the bus if the TAS finds a closed lock (i.e., if the TAS fails).

P4

**S**0

**S**1

P4

S0

S1

P2

**S**0

**S**1

P2

S0

S1

P3

S0

**S**1

P3

S0

**S**1

sw 0, mylock

ret

ret

P1

**S**0

**S**0

P1

S0

S0

Unlock:

Processor

Operation (Init.state)

Processor

Operation

(Init.state)

sw1

3

sw1

case (highest number of transactions)

Case A: Bus Trans.

Number

Case B: Bus Trans.

Number

---1

2)

| ) (POINTS 15/40) Write an MPI function that reads a color array (int color[1024]) and writes an array "int histogram[256]" that contains the              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| frequency of each of 256 possible colors (the 256 values are the value that each element of color[] can assume). A serial or serialized version has to be |
| avoided. The program should be written in a way that it exploits parallelism as offered by MPI. Reference scalar version:                                 |

unsigned char Color[1024]; int Histogram[256]; void histo\_scalar(int \*histogram, unsigned char \*color, int size, int wid, int nw) { for(int i=0; i<size; i++ ) histogram[ color[i] ] += 1;</pre>

Hints: use send and receive to distribute the work among the available workers/nodes; use MPI primitives: MPI Send, MPI Recv, MPI Iprobe.

## HIGH PERFORMANCE COMPUTER ARCHITECTURE final exam 23-12-2020 (solution trace) REV.231123

The state diagram for this protocol: 1)



The BEST case happens if the interleaving of the operations is such that each processor attempts and get access to the critical section one after the other.

| Bus Trans. | Processor    | P1 | P2 | P3 | P4 | Bus Transactions/Comments                                                |
|------------|--------------|----|----|----|----|--------------------------------------------------------------------------|
| Number     | Operation    |    |    |    |    |                                                                          |
|            | (Init.state) | S0 | S0 | S0 | S0 | Initially, P1 holds the lock                                             |
| 1          | sw1          | S0 | S1 | S1 | S1 | BusUpd1 – P1 releases the lock                                           |
|            | lw2          | S0 | S0 | S1 | S1 | P2 reads the lock (and it finds it open, i.e. ==0)                       |
| 2          | TAS2         | S1 | S0 | Ι  | Ι  | BusUpd2 – P2 tries to lock and succeeds                                  |
| 3          | sw2          | Ι  | D  | Ι  | Ι  | BusUpd2 – P2 releases the lock                                           |
| 4          | lw3          | Ι  | S0 | S0 | Ι  | <b>BusRd3+Flush2</b> –P3 reads the lock (and it finds it open, i.e. ==0) |
| 5          | TAS3         | Ι  | S1 | S0 | Ι  | BusUpd3 – P3 tries to lock and succeeds                                  |
| 6          | sw3          | Ι  | Ι  | D  | Ι  | BusUpd3 – P3 releases the lock                                           |
| 7          | lw4          | Ι  | Ι  | S0 | S0 | BusRd4+Flush3 –P4 reads the lock (and it finds it open, i.e. ==0)        |
| 8          | TAS4         | Ι  | Ι  | S1 | S0 | BusUpd4 – P4 tries to lock and succeeds                                  |
| 9          | sw4          | Ι  | Ι  | Ι  | D  | <b>BusUpd4</b> – P4 releases the lock                                    |

The WORST case happens if the interleaving of the operations is such that each processor attempts simultaneously the "lw" to read the status of mylock and then simultaneously try to get the access through the TAS instruction. In this cache each processor has cached the lock and if the TAS fails no bus transaction is issued

| Bus Trans. | Processor    | P1 | P2              | P3              | P4 | Bus Transactions/Comments                                                          |
|------------|--------------|----|-----------------|-----------------|----|------------------------------------------------------------------------------------|
| Number     | Operation    |    |                 |                 |    |                                                                                    |
|            | (Init.state) | S0 | S0              | S0              | S0 | Initially, P1 holds the lock                                                       |
| 1          | sw1          | S0 | S1              | S1              | S1 | BusUpd1 P1 releases the lock                                                       |
|            | lw2          | S0 | S0              | S1              | S1 | P2 reads the lock $\rightarrow$ reads $0 \rightarrow$ the first bne doesn't branch |
|            | lw3          | S0 | S0              | S0              | S1 | P3 reads the lock $\rightarrow$ reads $0 \rightarrow$ the first bne doesn't branch |
|            | lw4          | S0 | S0              | S0              | S0 | P4 reads the lock $\rightarrow$ reads $0 \rightarrow$ the first bne doesn't branch |
| 2          | TAS2         | S1 | S0              | S1              | S1 | BusUpd2 – P2 gets the lock and updates the others                                  |
| 3          | TAS3         | Ι  | <mark>S1</mark> | S0              | I  | TAS fails: no lock <mark>→ BusUpd3</mark>                                          |
| 4,5        | TAS4         | Ι  | I               | <mark>S1</mark> | S0 | BusRd4 - TAS fails: no lock → BusUpd4                                              |
| 6,7        | sw2          | Ι  | S0              | I               | S1 | BusRd2+BusUpd2 P2 releases the lock                                                |
| 9          | lw3          | Ι  | S0              | S0              | S1 | BusRd3 - P3 reads the lock                                                         |
|            | lw4          | Ι  | S0              | S0              | S0 | P4 reads the lock                                                                  |
| 10         | TAS3         | Ι  | S1              | S0              | S1 | BusUpd3 – P3 gets the lock and updates the others                                  |
|            | TAS4         | Ι  | I               | S1              | S0 | TAS fails: no lock                                                                 |
| 11         | sw3          | Ι  | Ι               | S0              | S1 | BusUpd3 P3 releases the lock                                                       |
|            | lw4          | Ι  | Ι               | S0              | S0 | P4 reads the lock                                                                  |
| 12         | TAS4         | Ι  | Ι               | S1              | S0 | BusUpd4 – P4 gets the lock and updates the others                                  |
| 13         | sw4          | Ι  | Ι               | Ι               | D  | BusUpd4 – P4 releases the lock                                                     |

2) This is the MPI code for a possible implementation of the requested function:

## #define MASTER 0

| efine MASTER 0                                                                                                                                                                               | if (wid == MASTER) { // Process the partial results                               |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| efine TAG_GENERAL 1                                                                                                                                                                          | <pre>int w, done = 0;</pre>                                                       |
| id histo_mpi(int *histogram, char *color, int size, int wid, int nw) {                                                                                                                       |                                                                                   |
| int i, histogram_private[256];                                                                                                                                                               | // Accumulate the result                                                          |
| <pre>for(i=0; i&lt;256; i++) histogram_private[i] = 0;</pre>                                                                                                                                 | <pre>for (int i= 0; i&lt;256; ++i) histogram[i] += histogram_private[i];</pre>    |
| MPI_Status Stat;                                                                                                                                                                             |                                                                                   |
| int ssize = (size - 1) $/nw + 1;$                                                                                                                                                            | do {// Get partial histograms from slaves                                         |
| <pre>char *slice = malloc(ssize * sizeof(char));</pre>                                                                                                                                       | <pre>for (w=1; w<nw; ++w)="" check<="" pre="" robin="" round="" {=""></nw;></pre> |
|                                                                                                                                                                                              | <pre>int dataWaitingFlag;</pre>                                                   |
| if (wid == MASTER) { // Distribute the work                                                                                                                                                  | MPI_Iprobe(w, TAG_GENERAL, MPI_COMM_WORLD,                                        |
| for (int i= 0; i <ssize; ++i)="" 1st="" assign="" master<="" slice="" slice[i]="color[i];" td="" to=""><td><pre>&amp;dataWaitingFlag, MPI_STATUS_IGNORE);</pre></td></ssize;>                | <pre>&amp;dataWaitingFlag, MPI_STATUS_IGNORE);</pre>                              |
| for (int w=1; w <nw; ++w)="" other="" send="" slaves<="" slices="" td="" the="" to=""><td>if (dataWaitingFlag) { // Get the message</td></nw;>                                               | if (dataWaitingFlag) { // Get the message                                         |
| MPI_Send(color + w*ssize, ssize, MPI_CHAR, w, TAG_GENERAL, MPI_COMM_WORLD);                                                                                                                  | MPI_Recv(histogram_private, 256, MPI_INT, w,                                      |
| } else { // slave                                                                                                                                                                            | TAG GENERAL, MPI_COMM_WORLD, &Stat);                                              |
| int dataWaitingFlag; // Wait until a message is there to be received                                                                                                                         | ++done;                                                                           |
| do MPI Iprobe(MASTER, TAG GENERAL, MPI COMM WORLD,                                                                                                                                           |                                                                                   |
| <pre>&amp;dataWaitingFlag, MPI STATUS IGNORE);</pre>                                                                                                                                         | // Accumulate the result                                                          |
| while (!dataWaitingFlag);                                                                                                                                                                    | <pre>for (int i= 0; i&lt;256; ++i) histogram[i] += histogram private[i];</pre>    |
| MPI Recv(slice, ssize, MPI CHAR, MASTER, TAG GENERAL, MPI COMM WORLD, &Stat);                                                                                                                | }                                                                                 |
| }                                                                                                                                                                                            | }                                                                                 |
|                                                                                                                                                                                              | $\}$ while (done < nw - 1);                                                       |
| // Processing data                                                                                                                                                                           | } else // slave: send back the partial result                                     |
| for (int i= 0; i <ssize; ++i)="" free(slice);<="" histogram="" private[slice[i]]++;="" td=""><td>MPI Send(histogram private, 256, MPI INT, MASTER, TAG GENERAL, MPI COMM WORLD</td></ssize;> | MPI Send(histogram private, 256, MPI INT, MASTER, TAG GENERAL, MPI COMM WORLD     |
|                                                                                                                                                                                              | ,                                                                                 |

d histo\_mpi(int \*histogram, char \*color, int size, int wid, int int rc, i, histogram\_private[HISTOGRAM\_BIN\_COUNT]; for(i=0; i<HISTOGRAM\_BIN\_COUNT; i++) histogram\_private[i] = 0; int ssize = (size - 1) /mw + 1; char \*slice = malloc(ssize \* sizeof(char)); wid, int nw)