NVIDIA Compute Sanitizer Part 2

CisMine Ng
3 min readMar 5, 2024

--

In this article, I will continue discussing how to use the NVIDIA Compute Sanitizer. Please read these articles: NVIDIA Compute Sanitize Part 1, Data Hazard before reading this one.

Following up on Part 1, Part 2 will cover the remaining two tools:

  • Racecheck, a shared memory data access hazard detection tool
  • Synccheck for thread synchronization hazard detection

Racecheck

As NVIDIA has mentioned about the NVIDIA Compute Sanitizer, Racecheck is used to check for hazards when using shared memory. So, if you test on global memory, it will not yield any results.

Code

__global__ void sumWithSharedMemory(int* input) {
__shared__ int sharedMemory[4];

int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + threadIdx.x;

sharedMemory[tid] = input[i];

for (int stride = 1; stride < blockDim.x; stride *= 2) {

// __syncthreads(); -----> barrier

if (tid % (2 * stride) == 0) {
sharedMemory[tid] += sharedMemory[tid + stride];
}
}

printf("blockIdx.x=%d --> %d\n", blockIdx.x, sharedMemory[tid]);

}

This code snippet is identical to the one in the Data Hazard article, and here is how it works.

The only difference is that instead of using global memory, here we use shared memory.

Shared memory is a topic I will discuss separately since it is a very important concept when talking about CUDA. So, in this article, you only need to understand that instead of using global memory to perform the addition 1+2+3+4 in parallel, we use shared memory.

And now we use the NVIDIA Compute Sanitizer to check for data hazards with a command line.

compute-sanitizer --tool racecheck --show-backtrace no  ./a.out

Here, you will find a surprising fact that the result is still correct even though there has been a data hazard. As I mentioned in the previous article about the phenomenon of “undefined behavior,” this is exactly it. We cannot determine whether it will cause an error or not. It may be that my machine produces the correct result, BUT yours might differ ==> thus, the phenomenon of “undefined behavior” can be quite troublesome.

At this point, if we use __syncthreads(), it will solve this problem.

Regarding Synccheck, from my tests and searches through NVIDIA’s blog, I noticed they didn’t mention anything about the code, so I cannot provide an illustration for you. I will skip this part, but I will add it later if I find any relevant information (if you find something, please comment below).

Exercise

The actual error in our case is just 4 data hazards (N = 4), but why does the image above show us having 2 data hazard errors (4 and 8)?

Hint: 1 data hazard = 1 read or 1 write.

Is 4 data hazards comprised of 4 reads or 4 writes?

And is 8 data hazards comprised of 8 reads, 8 writes, or 4 reads and 4 writes?

--

--

CisMine Ng
CisMine Ng

Written by CisMine Ng

My name is CisMine Ng. I am currently pursuing a B.Sc. degree. I am interested in the following topics: DL in Computer Vision, Parallel Programming With Cuda.

No responses yet