MATR.NO.

SURNAME

FIRST NAME

(POINTS 25/40) Consider a four-processor bus-based multiprocessor using the MSI protocol. Each processor executes a TAS instruction to lock and gain access to an empty critical section. The initial condition is such that processor 1 has the lock and processor 2, 3, and 4 are spinning on their caches waiting for the lock to be released. Every processor gets the lock once and exits the program. These are the implementations of the lock and unlock:

| LOCK:   | IW R1, mylock           | # R1 = &mylock                                            |
|---------|-------------------------|-----------------------------------------------------------|
|         | bne R1, R0, Lock        | <pre># if (R1 != 0) jump to Lock</pre>                    |
|         | TAS R1, mylock          | <pre># atomically_do {R1 = &amp;mylock mylock = 1;}</pre> |
|         | bne R1, R0, Lock<br>ret | <pre># if (R1 != 0) jump to Lock</pre>                    |
| Unlock: | sw 0, mylock<br>ret     | # write 0 into &mylock                                    |

Note1: the semantic of the TAS (Test And Set) instruction is the following: atomically reads the specified memory location (mylock) and writes a one into that memory location (mylock). Note2: this implementation of the Lock tries to minimize the probability to have the bus locked by the TAS (this implementation is also known as Test-and-Set). Note3: the lock is closed when mylock==1 and it is open when mylock==0.

By using the following tables, show the operations and bus transactions (or comments): A) in the best case (least number of transactions) and B) in the worst case (highest number of transactions)

| A) Best    | case:        |    |    |    |    |                                       |
|------------|--------------|----|----|----|----|---------------------------------------|
| Bus Trans. | Processor    | P1 | P2 | P3 | P4 | Bus Transactions/Comments             |
| Number     | Operation    |    |    |    |    |                                       |
|            | (Init.state) | S  | S  | S  | S  | Initially, P1 holds the lock          |
| 1          | sw1          | М  | Ι  | Ι  | Ι  | <b>BusUpgr</b> – P1 releases the lock |
|            |              |    |    |    |    |                                       |
|            |              |    |    |    |    |                                       |
|            |              |    |    |    |    |                                       |
|            |              |    |    |    |    |                                       |
|            |              |    |    |    |    |                                       |
|            |              |    |    |    |    |                                       |
|            |              |    |    |    |    |                                       |
|            |              |    |    |    |    |                                       |
|            |              |    |    |    |    |                                       |

B) Worst case:

| /          | st case:     |    | 7.0 | <b>D</b> 2 | 54 |                                |
|------------|--------------|----|-----|------------|----|--------------------------------|
| Bus Trans. | Processor    | P1 | P2  | P3         | P4 | Bus Transactions/Comments      |
| Number     | Operation    |    |     |            |    |                                |
|            | (Init.state) | S  | S   | S          | S  | Initially, P1 holds the lock   |
| 1          | sw1          | М  | Ι   | Ι          | Ι  | BusUpgr – P1 releases the lock |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            | -  |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |
|            |              |    |     |            |    |                                |

2) (POINTS 15/40) Write a CILK function that reads a color array (int color[1024]) and writes an array "int histogram[256]" that contains the frequency of each of 256 possible colors (the 256 values are the value that each element of color[] can assume). A serial or serialized version has to be avoided. The program should be written in a way that it exploits Thread Level Parallelism as offered by CILK. Reference scalar version:

| typedef unsigned int uint;                                                                          |
|-----------------------------------------------------------------------------------------------------|
| typedef unsigned char uchar;                                                                        |
| <pre>#define HISTOGRAM_BIN_COUNT 256</pre>                                                          |
| uchar Color[1024];                                                                                  |
| uint Histogram[HISTOGRAM_BIN_COUNT];                                                                |
| <pre>void histo_scalar(uint *histogram, uchar *color, uint size) {</pre>                            |
| <pre>for(uint i=0; i<size; )="" +="1;&lt;/pre" ]="" color[i]="" histogram[="" i++=""></size;></pre> |
|                                                                                                     |

## HIGH PERFORMANCE COMPUTER ARCHITECTURE final exam 18-12-2019 (solution trace)

1) Remembering the state diagram for the MSI protocol:



1A) The best case happens if the interleaving of the operations is such that each processor attempts and get access to the critical section one after the other.

| Bus Trans. | Processor    | P1 | P2 | P3 | P4 | Bus Transactions/Comments               |
|------------|--------------|----|----|----|----|-----------------------------------------|
| Number     | Operation    |    |    |    |    |                                         |
|            | (Init.state) | S  | S  | S  | S  | Initially, P1 holds the lock            |
| 1          | sw1          | М  | Ι  | Ι  | Ι  | BusUpgr – P1 releases the lock          |
| 2          | lw2          | S  | S  | Ι  | Ι  | BusRd/Flush –P2 reads the lock          |
| 3          | TAS2         | Ι  | М  | Ι  | Ι  | BusUpgr – P2 tries to lock and succeeds |
|            | sw2          | Ι  | М  | Ι  | Ι  | P2 releases the lock                    |
| 4          | 1w3          | Ι  | S  | S  | Ι  | BusRd/Flush –P3 reads the lock          |
| 5          | TAS3         | Ι  | Ι  | М  | Ι  | BusUpgr – P3 tries to lock and succeeds |
|            | sw3          | Ι  | Ι  | М  | Ι  | P3 releases the lock                    |
| 6          | 1w4          | Ι  | Ι  | S  | S  | BusRd/Flush –P4 reads the lock          |
| 7          | TAS4         | Ι  | Ι  | Ι  | М  | BusUpgr – P4 tries to lock and succeeds |
|            | sw4          | Ι  | Ι  | Ι  | М  | P4 releases the lock                    |

1B) The worst case happens if the interleaving of the operations is such that each processor attempts simultaneously the "lw" to read the status of mylock and then simultaneously try to get the access through the TAS instruction.

| Bus Trans. | Processor    | P1 | P2 | P3 | P4 | Bus Transactions/Comments              |
|------------|--------------|----|----|----|----|----------------------------------------|
| Number     | Operation    |    |    |    |    |                                        |
|            | (Init.state) | S  | S  | S  | S  | Initially, P1 holds the lock           |
| 1          | sw1          | М  | Ι  | Ι  | Ι  | BusUpgr P1 releases the lock           |
| 2          | lw2          | S  | S  | Ι  | Ι  | BusRd/Flush – P2 reads the lock        |
| 3          | lw3          | S  | S  | S  | Ι  | <b>BusRd</b> – P3 reads the lock       |
| 4          | lw4          | S  | S  | S  | S  | <b>BusRd</b> – P4 reads the lock       |
| 5          | TAS2         | Ι  | М  | Ι  | Ι  | <b>BusUpgr</b> – P2 gets the lock      |
| 6          | TAS3         | Ι  | Ι  | М  | Ι  | BusRdX/Flush no lock                   |
| 7          | TAS4         | Ι  | Ι  | Ι  | М  | BusRdX/Flush no lock                   |
| 8          | st2          | Ι  | М  | Ι  | Ι  | BusRdX/Flush P2 releases the lock      |
| 9          | lw3          | Ι  | S  | S  | Ι  | BusRd/Flush – P3 reads the lock        |
| 10         | 1w4          | Ι  | S  | S  | S  | <b>BusRd</b> – P4 reads the lock       |
| 11         | TAS3         | Ι  | Ι  | М  | Ι  | BusUpgr – P3 gets the lock             |
| 12         | TAS4         | Ι  | Ι  | Ι  | М  | BusRdX/Flush no lock                   |
| 13         | sw3          | Ι  | Ι  | М  | Ι  | BusRdX/Flush P3 releases the lock      |
| 14         | lw4          | Ι  | Ι  | S  | S  | <b>BusRd/Flush</b> – P4 reads the lock |
| 15         | TAS4         | Ι  | Ι  | Ι  | М  | BusUpgr – P4 gets the lock             |
|            | sw4          | Ι  | Ι  | Ι  | М  | P4 releases the lock                   |
|            |              |    |    |    |    |                                        |

\* Depending on the implementation a BusRdX could be directly associated to the (atomic) TAS instruction.

2) This is the CILK code for a possible implementation of the requested kernel:

```
typedef unsigned int uint;
                                                 void histo_cilk2(uint *histogram, uchar *color, uint size)
typedef unsigned char uchar;
                                                 {
#define HISTOGRAM_BIN_COUNT 256
                                                     if (size == 1) {
uchar Color[1024];
                                                         Cilk_lock(&lock[*color]);
uint Histogram[HISTOGRAM_BIN_COUNT];
                                                         histogram[*color]++;
                                                         Cilk_unlock(&lock[*color]);
#define Cilk_lockvar pthread_mutex_t
                                                     ł
#define Cilk_lock pthread_mutex_lock
                                                     else {
                                                         cilk_spawn histo_cilk2(histogram, color, size/2);
cilk_spawn histo_cilk2(histogram, color + size/2, size - size/2);
#define Cilk_unlock pthread_mutex_unlock
Cilk_lockvar lock[HISTOGRAM_BIN_COUNT];
                                                          cilk_sync;
                                                     }
                                                }
```