Data Hazard

CisMine Ng
3 min readFeb 13, 2024

--

When we talk about parallelism, we often encounter the phenomenon of data hazard, a bug that can be quite headache-inducing to fix because it is a logical error. However, now we have tools like NVIDIA Compute Sanitizer which make fixing this bug somewhat easier. In this article, I will explain what a data hazard is and illustrate it.

It would be better if you read the article on Synchronization — Asynchronization before reading this one.

The phenomenon where multiple threads read and write a certain value leads to conflicts, and this phenomenon is called a data hazard.

When discussing data hazards, we encounter two issues:

  • Data Race: This usually relates to “write after read” or “read after write”, but it mainly focuses on simultaneous access (reading and/or writing) to a stored variable without synchronization. This can lead to a situation where one thread overwrites data that another thread is reading or preparing to write, leading to a conflict in data value.
  • Race Condition: This concept is broader and not limited to data access. A race condition occurs when the final result of a system shows an undefined behavior or event.

In summary, just remember: when coding in CUDA, be mindful of the phenomenon where multiple threads access the same value for processing.

Illustration

#include <stdio.h>
#include <cuda_runtime.h>

#define ARRAY_SIZE 4


__global__ void sum(int d_array)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;

for (int stride = 1; stride < 4; stride *= 2)
{
// __syncthreads(); -----> barrier

if (threadIdx.x % (2 * stride) == 0)
{
d_array[id] += d_array[id + stride];
}
}
printf("blockIdx.x=%d --> %d\n", blockIdx.x, d_array[id]);
}

int main()
{
int h_array[4] = {1, 2, 3, 4};
int *d_array;

cudaMalloc((void **)&d_array, sizeof(int) * ARRAY_SIZE);
cudaMemcpy(d_array, h_array, sizeof(int) * ARRAY_SIZE, cudaMemcpyHostToDevice);

sum<<<1, 4>>>(d_array);

cudaFree(d_array);

return 0;
}

This is an illustration of the principle of how the code operates.

Here, we do not synchronize the threads, leading to a data race (in step 1: before 3+4 is completed, it moves to step 2, so it’s 3 + 3 = 6 instead of 3 + 7 = 10).

To solve this problem, we just need to place a barrier to make the threads wait for each other until the slowest threads have finished, using the command __syncthreads().

And the output after adding syncthreads.

--

--

CisMine Ng
CisMine Ng

Written by CisMine Ng

My name is CisMine Ng. I am currently pursuing a B.Sc. degree. I am interested in the following topics: DL in Computer Vision, Parallel Programming With Cuda.

No responses yet