| HIGH  | PERFORMANCE | COMPUTER | ARCHITECTURE | final | exam | 19-12-2017 | MATR.NO. |  |
|-------|-------------|----------|--------------|-------|------|------------|----------|--|
| (REV) | STON 1 1)   |          |              |       |      |            | SURNAME  |  |

FIRST NAME

(POINTS 25/40) Consider a four-processor bus-based multiprocessor using the MESI protocol. Each processor executes 1) a TAS instruction to lock and gain access to an empty critical section. The initial condition is such that processor 1 has the lock and processor 2, 3, and 4 are spinning on their caches waiting for the lock to be released. Every processor gets the lock once and exits the program. These are the implementations of the lock and unlock:

| Lock:   | lw R1, mylock<br>bne R1, R0, Lock<br>TAS R1, mylock<br>bne R1, R0, Lock<br>ret | <pre># R1 = &amp;mylock # if (R1 != 0) jump to Lock # atomically_do {R1 = &amp;mylock mylock = 1;} # if (R1 != 0) jump to Lock</pre> |
|---------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| Unlock: | sw 0, mylock                                                                   | # write 0 into &mylock                                                                                                               |

Note1: the semantic of the TAS (Test And Set) instruction is the following: atomically reads the specified memory location (mylock) and writes a one into that memory location (mylock). Note2: this implementation of the Lock tries to minimize the probability to have the bus locked by the TAS (this implementation is also known as Test-and-Test-and-Set). Note3: the lock is closed when mylock==1 and it is open when mylock==0.

By using the following tables, show the operations and bus transactions (or comments): A) in the best case (least number of transactions) and B) in the worst case (highest number of transactions)

| A) Best    | A) Best case: |    |    |    |    |                                       |  |  |  |  |
|------------|---------------|----|----|----|----|---------------------------------------|--|--|--|--|
| Bus Trans. | Processor     | P1 | P2 | P3 | P4 | Bus Transactions/Comments             |  |  |  |  |
| Number     | Operation     |    |    |    |    |                                       |  |  |  |  |
|            | (Init.state)  | S  | S  | S  | S  | Initially, P1 holds the lock          |  |  |  |  |
| 1          | sw1           | М  | Ι  | Ι  | Ι  | <b>BusUpgr</b> – P1 releases the lock |  |  |  |  |
|            |               |    |    |    |    |                                       |  |  |  |  |
|            |               |    |    |    |    |                                       |  |  |  |  |
|            |               |    |    |    |    |                                       |  |  |  |  |
|            |               |    |    |    |    |                                       |  |  |  |  |
|            |               |    |    |    |    |                                       |  |  |  |  |
|            |               |    |    |    |    |                                       |  |  |  |  |
|            |               |    |    |    |    |                                       |  |  |  |  |
|            |               |    |    |    |    |                                       |  |  |  |  |
|            |               |    |    |    |    |                                       |  |  |  |  |

B) Worst case:

ret

| Bus Trans.<br>Number | Processor<br>Operation | P1 | P2 | P3 | P4 | Bus Transactions/Comments      |
|----------------------|------------------------|----|----|----|----|--------------------------------|
|                      | (Init.state)           | S  | S  | S  | S  | Initially, P1 holds the lock   |
| 1                    | sw1                    | М  | Ι  | Ι  | Ι  | BusUpgr – P1 releases the lock |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |
|                      |                        |    |    |    |    |                                |

(POINTS 15/40) Write a CUDA program that reads a color array (int color[1024]) and writes an array "int histogram[256]" 2) that contains the frequency of each of 256 possible colors (the 256 values are the values that each element of color[] can assume). The program should be written in a way that it exploits Thread Level Parallelism as offered by CUDA (a serial or serialized version has to be avoided). Hint: try to perform operations in a hierarchical way and use CUDA shared memory.

HIGH PERFORMANCE COMPUTER ARCHITECTURE final exam 19-12-2017 (solution trace)

1) Remembering the state diagram for the MESI protocol:



1A) The best case happens if the interleaving of the operations is such that each processor attempts and get access to the critical section one after the other.

| Bus Trans. | Processor    | P1 | P2 | P3 | P4 | Bus Transactions/Comments                      |
|------------|--------------|----|----|----|----|------------------------------------------------|
| Number     | Operation    |    |    |    |    |                                                |
|            | (Init.state) | S  | S  | S  | S  | Initially, P1 holds the lock                   |
| 1          | sw1          | М  | Ι  | Ι  | Ι  | <b>BusUpgr</b> – P1 releases the lock          |
| 2          | lw2          | S  | S  | Ι  | Ι  | BusRd(Ls)/Flush –P2 reads the lock             |
| 3          | TAS2         | Ι  | М  | Ι  | Ι  | <b>BusUpgr</b> – P2 tries to lock and succeeds |
|            | sw2          | Ι  | М  | Ι  | Ι  | P2 releases the lock                           |
| 4          | lw3          | Ι  | S  | S  | Ι  | BusRd(Ls)/Flush –P3 reads the lock             |
| 5          | TAS3         | Ι  | Ι  | Μ  | Ι  | <b>BusUpgr</b> – P3 tries to lock and succeeds |
|            | sw3          | Ι  | Ι  | Μ  | Ι  | P3 releases the lock                           |
| 6          | lw4          | Ι  | Ι  | S  | S  | BusRd(Ls)/Flush –P4 reads the lock             |
| 7          | TAS4         | Ι  | Ι  | Ι  | М  | <b>BusUpgr</b> – P4 tries to lock and succeeds |
|            | sw4          | Ι  | Ι  | Ι  | М  | P4 releases the lock                           |

1B) The worst case happens if the interleaving of the operations is such that each processor attempts simultaneously the "lw" to read the status of mylock and then simultaneously try to get the access through the TAS instruction.

| Bus Trans. | Processor    | P1 | P2 | P3 | P4 | Bus Transactions/Comments                                    |
|------------|--------------|----|----|----|----|--------------------------------------------------------------|
| Number     | Operation    |    |    |    |    |                                                              |
|            | (Init.state) | S  | S  | S  | S  | Initially, P1 holds the lock                                 |
| 1          | sw1          | М  | Ι  | Ι  | Ι  | BusUpgr P1 releases the lock                                 |
| 2          | lw2          | S  | S  | Ι  | Ι  | BusRd(Ls)/Flush – P2 reads the lock                          |
| 3          | lw3          | S  | S  | S  | Ι  | BusRd(Ls)/Flush – P3 reads the lock                          |
| 4          | lw4          | S  | S  | S  | S  | BusRd(Ls)/Flush – P4 reads the lock                          |
| 5          | TAS2         | Ι  | Μ  | Ι  | Ι  | <b>BusUpgr</b> – P2 gets the lock                            |
| 6          | TAS3         | Ι  | Ι  | М  | Ι  | BusRd(Ls')(I to E)/Flush/(E to M) [or BusRdX/Flush*] no lock |
| 7          | TAS4         | Ι  | Ι  | Ι  | Μ  | BusRd(Ls')(I to E)/Flush/(E to M) [or BusRdX/Flush*] no lock |
| 8          | st2          | Ι  | Μ  | Ι  | Ι  | <b>BusRdX</b> P2 releases the lock                           |
| 9          | lw3          | Ι  | S  | S  | Ι  | BusRd(Ls)/Flush – P3 reads the lock                          |
| 10         | lw4          | Ι  | S  | S  | S  | BusRd(Ls)/Flush – P4 reads the lock                          |
| 11         | TAS3         | Ι  | Ι  | М  | Ι  | <b>BusUpgr</b> – P3 gets the lock                            |
| 12         | TAS4         | Ι  | Ι  | Ι  | Μ  | BusRd(Ls')(I to E)/Flush/(E to M) [or BusRdX/Flush*] no lock |
| 13         | sw3          | Ι  | Ι  | М  | Ι  | BusRdX P3 releases the lock                                  |
| 14         | lw4          | Ι  | Ι  | S  | S  | BusRd(Ls)/Flush – P4 reads the lock                          |
| 15         | TAS4         | Ι  | Ι  | Ι  | М  | <b>BusUpgr</b> – P4 gets the lock                            |
|            | sw4          | Ι  | Ι  | Ι  | М  | P4 releases the lock                                         |
|            |              |    |    |    |    |                                                              |
|            |              |    |    |    |    |                                                              |
|            |              |    |    |    |    |                                                              |

\* Depending on the implementation a BusRdX could be directly associated to the (atomic) TAS instruction.

2) This is the CUDA code for a possible implementation of the requested kernel (tested on Tesla C1060 with Compute Capability 1.3 and CUDA 4.1):

```
#include <stdio.h>
#include <cuda runtime.h</pre>
#include <cutil_inline.h>
#include <device_functions.h>
#include <sm_11_atomic_functions.h>
typedef unsigned char uchar;
typedef unsigned int uint;
#define HISTOGRAM BIN COUNT 256
#define N 1024
__global___void histogram3(uint* histogram, uchar* color, int size)
     shared uint data[HISTOGRAM BIN COUNT];
    // I ni ti a l i z a t i o n
    int stride = blockDim.x;
    for (int i = threadIdx.x; i < HISTOGRAM_BIN_COUNT; i += stride)</pre>
       data[i] = 0;
    ____syncthreads();
    // Calculate private histogram
stride = blockDim.x * gridDim.x;
    for (uint i = threadIdx.x + blockDim.x * blockIdx.x;
        i < size; i += stride)</pre>
        atomicAdd( &data[color[i]], 1);
     syncthreads();
    //Update global histogram
    stride = blockDim.x;
    for (uint i = threadIdx.x; i < HISTOGRAM BIN COUNT; i += stride)</pre>
        atomicAdd( &(histogram[i]), data[i] );
ł
int main() {
    uchar* hColor = (uchar*)malloc(N * sizeof(uchar));
    uint* hHistogram3 = (uint*)malloc(HISTOGRAM BIN COUNT * sizeof(uint));
    dim3 block, grid;
    uchar* dColor;
    uint* dHistogram;
    cudaMalloc(&dHistogram, HISTOGRAM_BIN_COUNT * sizeof(uint));
    cudaMalloc(&dColor, N * sizeof(uchar));
    srand(2017);
    for (uint i = 0; i < N; ++i) hColor[i] = (uchar)(rand() % 256);</pre>
    cudaMemcpy(dColor, hColor, N * sizeof(uchar), cudaMemcpyHostToDevice);
    cudaMemset(dHistogram, 0, HISTOGRAM BIN COUNT * sizeof(uint));
    block.x = 512;
    grid.x = (N + block.x - 1) / block.x;
    histogram3<<<<grid,block>>>(dHistogram, dColor, N);
    cudaMemcpy(hHistogram3, dHistogram,
       HISTOGRAM_BIN_COUNT * sizeof(uint), cudaMemcpyDeviceToHost);
    for (int i = 0; i < HISTOGRAM_BIN_COUNT; ++i)</pre>
        printf("%d ", hHistogram3[i]); printf("\n");
}
```

## Makefile: