Data Hazard
When we talk about parallelism, we often encounter the phenomenon of data hazard, a bug that can be quite headache-inducing to fix because it is a logical error. However, now we have tools like NVIDIA Compute Sanitizer which make fixing this bug somewhat easier. In this article, I will explain what a data hazard is and illustrate it.
It would be better if you read the article on Synchronization — Asynchronization before reading this one.
The phenomenon where multiple threads read and write a certain value leads to conflicts, and this phenomenon is called a data hazard.
When discussing data hazards, we encounter two issues:
- Data Race: This usually relates to “write after read” or “read after write”, but it mainly focuses on simultaneous access (reading and/or writing) to a stored variable without synchronization. This can lead to a situation where one thread overwrites data that another thread is reading or preparing to write, leading to a conflict in data value.
- Race Condition: This concept is broader and not limited to data access. A race condition occurs when the final result of a system shows an undefined behavior or event.
In summary, just remember: when coding in CUDA, be mindful of the phenomenon where multiple threads access the same value for processing.
Illustration
#include <stdio.h>
#include <cuda_runtime.h>
#define ARRAY_SIZE 4
__global__ void sum(int d_array)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;
for (int stride = 1; stride < 4; stride *= 2)
{
// __syncthreads(); -----> barrier
if (threadIdx.x % (2 * stride) == 0)
{
d_array[id] += d_array[id + stride];
}
}
printf("blockIdx.x=%d --> %d\n", blockIdx.x, d_array[id]);
}
int main()
{
int h_array[4] = {1, 2, 3, 4};
int *d_array;
cudaMalloc((void **)&d_array, sizeof(int) * ARRAY_SIZE);
cudaMemcpy(d_array, h_array, sizeof(int) * ARRAY_SIZE, cudaMemcpyHostToDevice);
sum<<<1, 4>>>(d_array);
cudaFree(d_array);
return 0;
}
This is an illustration of the principle of how the code operates.
Here, we do not synchronize the threads, leading to a data race (in step 1: before 3+4 is completed, it moves to step 2, so it’s 3 + 3 = 6 instead of 3 + 7 = 10).
To solve this problem, we just need to place a barrier to make the threads wait for each other until the slowest threads have finished, using the command __syncthreads().
And the output after adding syncthreads.