Data Hazard

CisMine Ng
3 min readFeb 13, 2024

When we talk about parallelism, we often encounter the phenomenon of data hazard, a bug that can be quite headache-inducing to fix because it is a logical error. However, now we have tools like NVIDIA Compute Sanitizer which make fixing this bug somewhat easier. In this article, I will explain what a data hazard is and illustrate it.

It would be better if you read the article on Synchronization — Asynchronization before reading this one.

The phenomenon where multiple threads read and write a certain value leads to conflicts, and this phenomenon is called a data hazard.

When discussing data hazards, we encounter two issues:

  • Data Race: This usually relates to “write after read” or “read after write”, but it mainly focuses on simultaneous access (reading and/or writing) to a stored variable without synchronization. This can lead to a situation where one thread overwrites data that another thread is reading or preparing to write, leading to a conflict in data value.
  • Race Condition: This concept is broader and not limited to data access. A race condition occurs when the final result of a system shows an undefined behavior or event.

In summary, just remember: when coding in CUDA, be mindful of the phenomenon where multiple threads access the same value for processing.

Illustration

#include <stdio.h>
#include <cuda_runtime.h>

#define ARRAY_SIZE 4


__global__ void sum(int d_array)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;

for (int stride = 1; stride < 4; stride *= 2)
{
// __syncthreads(); -----> barrier

if (threadIdx.x % (2 * stride) == 0)
{
d_array[id] += d_array[id + stride];
}
}
printf("blockIdx.x=%d --> %d\n", blockIdx.x, d_array[id]);
}

int main()
{
int h_array[4] = {1, 2, 3, 4};
int *d_array;

cudaMalloc((void **)&d_array, sizeof(int) * ARRAY_SIZE);
cudaMemcpy(d_array, h_array, sizeof(int) * ARRAY_SIZE, cudaMemcpyHostToDevice);

sum<<<1, 4>>>(d_array);

cudaFree(d_array);

return 0;
}

This is an illustration of the principle of how the code operates.

Here, we do not synchronize the threads, leading to a data race (in step 1: before 3+4 is completed, it moves to step 2, so it’s 3 + 3 = 6 instead of 3 + 7 = 10).

To solve this problem, we just need to place a barrier to make the threads wait for each other until the slowest threads have finished, using the command __syncthreads().

And the output after adding syncthreads.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

CisMine Ng
CisMine Ng

Written by CisMine Ng

My name is CisMine Ng. I am currently pursuing a B.Sc. degree. I am interested in the following topics: DL in Computer Vision, Parallel Programming With Cuda.

No responses yet

Write a response