Data Hazard

3 min readFeb 13, 2024

When we talk about parallelism, we often encounter the phenomenon of data hazard, a bug that can be quite headache-inducing to fix because it is a logical error. However, now we have tools like NVIDIA Compute Sanitizer which make fixing this bug somewhat easier. In this article, I will explain what a data hazard is and illustrate it.

It would be better if you read the article on Synchronization — Asynchronization before reading this one.

The phenomenon where multiple threads read and write a certain value leads to conflicts, and this phenomenon is called a data hazard.

When discussing data hazards, we encounter two issues:

Data Race: This usually relates to “write after read” or “read after write”, but it mainly focuses on simultaneous access (reading and/or writing) to a stored variable without synchronization. This can lead to a situation where one thread overwrites data that another thread is reading or preparing to write, leading to a conflict in data value.
Race Condition: This concept is broader and not limited to data access. A race condition occurs when the final result of a system shows an undefined behavior or event.

In summary, just remember: when coding in CUDA, be mindful of the phenomenon where multiple threads access the same value for processing.

Illustration

#include <stdio.h>
#include <cuda_runtime.h>

#define ARRAY_SIZE 4


__global__ void sum(int d_array)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;

    for (int stride = 1; stride < 4; stride *= 2)
    {
      //  __syncthreads();   -----> barrier

        if (threadIdx.x % (2 * stride) == 0)
        {
            d_array[id] += d_array[id + stride];
        }
    }
    printf("blockIdx.x=%d --> %d\n", blockIdx.x, d_array[id]);
}

int main()
{
    int h_array[4] = {1, 2, 3, 4};
    int *d_array;

   cudaMalloc((void **)&d_array, sizeof(int) * ARRAY_SIZE);
   cudaMemcpy(d_array, h_array, sizeof(int) * ARRAY_SIZE, cudaMemcpyHostToDevice);

    sum<<<1, 4>>>(d_array);

    cudaFree(d_array);

    return 0;
}

This is an illustration of the principle of how the code operates.

Here, we do not synchronize the threads, leading to a data race (in step 1: before 3+4 is completed, it moves to step 2, so it’s 3 + 3 = 6 instead of 3 + 7 = 10).

To solve this problem, we just need to place a barrier to make the threads wait for each other until the slowest threads have finished, using the command __syncthreads().

And the output after adding syncthreads.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by CisMine Ng

75 Followers

12 Following

My name is CisMine Ng. I am currently pursuing a B.Sc. degree. I am interested in the following topics: DL in Computer Vision, Parallel Programming With Cuda.

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

Amit Yadav

PyTorch Distributed Data Parallel (DDP)

I understand that learning data science can be really challenging…

Nov 2, 2024

This new IDE from Google is an absolute game changer

Coding Beauty

Tari Ibaba

This new IDE from Google is an absolute game changer

This new IDE from Google is seriously revolutionary.

Mar 11

194

Mastering GPU Memory Management With PyTorch and CUDA

Level Up Coding

Sahib Dhanjal

Mastering GPU Memory Management With PyTorch and CUDA

A gentle introduction to memory management using PyTorch’s CUDA Caching Allocator

Mar 25

Simple Ways to Tell if Python Code Was Written by an LLM

Science Spectrum

Laurel W

Simple Ways to Tell if Python Code Was Written by an LLM

Yes, We Can Tell

Mar 22

How to deploy YOLOv12 on NVIDIA Jetson devices? 👁️🧠

Henry Navarro

How to deploy YOLOv12 on NVIDIA Jetson devices? 👁️🧠

Deploy any YOLO model on NVIDIA Jetson devices using Ultralytics and Flask.

Mar 17

How I Am Using a Lifetime 100% Free Server

Harendra

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

Oct 26, 2024

181

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech

Data Hazard

Illustration

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by CisMine Ng

No responses yet

More from CisMine Ng

Pinned memory

In this article, I will discuss the concept of pinned memory — please note that it will be related to the next article (streaming), so it…

Memory Types in GPU

Global Memory Coalescing

Global memory is the largest memory BUT also the slowest on the GPU, so in this article we will analyze what factors lead to “low…

Introduction to Nsight Systems — Nsight Compute

In this article, I will provide a brief introduction to Nsight Systems and Nsight Compute, giving you an overview of which tool to use for…

Recommended from Medium

PyTorch Distributed Data Parallel (DDP)

I understand that learning data science can be really challenging…

This new IDE from Google is an absolute game changer

This new IDE from Google is seriously revolutionary.

Mastering GPU Memory Management With PyTorch and CUDA

A gentle introduction to memory management using PyTorch’s CUDA Caching Allocator

Simple Ways to Tell if Python Code Was Written by an LLM

Yes, We Can Tell

How to deploy YOLOv12 on NVIDIA Jetson devices? 👁️🧠

Deploy any YOLO model on NVIDIA Jetson devices using Ultralytics and Flask.

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free