This is the introduction to the CUDA Parallel programing. Today we will learn how to the programming using CUDA. Will write the “Hello World” program using CUDA Kernal. Will try to understand the execution of code on CPU and GPU.

As we all know that CUDA is a platform and programming model for CUDA-enabled GPUs. The platform exposes GPUs for general purpose computing. CUDA provides C/C++ language extension and APIs for programming and managing GPUs. So, before starting make sure you have a NVIDIA GPU. I am assuming that you have the environment ready.

CUDA C/C++

In CUDA programming, both CPUs and GPUs are used for computing. Typically, we refer to CPU as host and GPU system as device. CPUs and GPUs are separated platforms with their own memory space. Typically, we run serial workload on CPU and offload parallel computation to GPUs.

Kernel

In CUDA programming, a kernel is essentially a function that is executed on the GPU. These functions are written to perform computations in parallel, meaning they can execute on hundreds or thousands of threads simultaneously.

Instead of running a complex operation sequentially as you would in traditional CPU-based programming, you can use CUDA kernels to break down the operation into smaller tasks that can be performed at the same time, greatly enhancing processing speed for suitable tasks.

Here’s a very basic example of what a CUDA kernel might look like:

__global__ void myKernel(void)
{
    printf("Hello CUDA world!!!");
}

In this example, __global__ is a CUDA keyword indicating that the function runs on the GPU and can be called from the CPU. This function i.e. myKernel prints the string “Hello CUDA world” on standard O/P using GPU.

Now let us write a complete program and run it to see the output.

#include <stdio.h>
#include <cuda.h>

__global__ void cuda_hello(){
    printf("Hello CUDA World!!!\n");
}

int main() {
    cuda_hello<<<1,1>>>(); 
    return 0;
}

The major difference between C and CUDA implementation is __global__ specifier and <<<...>>> syntax. The __global__ specifier indicates a function that runs on device (GPU). Such function can be called through host code, e.g. the main() function in the example, and is also known as “kernels“.

When a kernel is called, its execution configuration is provided through <<<...>>> syntax, e.g.  cuda_hello<<<1,1>>>(). In CUDA terminology, this is called “kernel launch“.

Compiling CUDA programs

Compiling a CUDA program is similar to C program. NVIDIA provides a CUDA compiler called nvcc in the CUDA toolkit to compile CUDA code, typically stored in a file with extension.cu. For example,

$> nvcc hello.cu -o hello

Try to run the “hello” executable…

What !!! why there is no output 🙁

Don’t be surprised. There is no output on the terminal because, the program did not waited for the GPU execution. Here we have send the kernel to the GPU but we have not waiting for the reply. Now try the below code, compile and run it

#include <stdio.h>
#include <cuda.h>

__global__ void cuda_hello(){
    printf("Hello CUDA World!!!\n");
}

int main() {
    cuda_hello<<<1,1>>>(); 
    cudaDeviceSynchronize();
    return 0;
}

This time you will see the “Hello CUDA world!!!” printed on the terminal. This time the program has waited for all the GPU threads to finish & then waited for the GPU buffer output.

Exercise 2

#include <stdio.h>
#include <cuda.h>

__global__ void cuda_hello(){
    printf("Hello CUDA World!!!\n");
}

int main() {
    cuda_hello<<<1,1>>>();
    printf("Hello CPU world\n");
    cudaDeviceSynchronize();
    return 0;
}

compile this code and run it. This program can have two outputs i.e.

Hello CPU world
Hello CUDA world!!!

Hello CUDA world!!!
Hello CPU world

The question is, why the output is not deterministic? The answer is, because the CPU execution may not wait for the buffer output of GPU. So, the CPU print will come first and then the GPU output. To make it deterministic, lets write it again.

#include <stdio.h>
#include <cuda.h>

__global__ void cuda_hello(){
    printf("Hello CUDA World!!!\n");
}

int main() {
    cuda_hello<<<1,1>>>();
    cudaDeviceSynchronize();
    printf("Hello CPU world\n");
    return 0;
}

This time, the output from the GPU will be printed first and then from the CPU.

Now, let is see what is output of below code?

#include <stdio.h>
#include <cuda.h>

__global__ void cuda_hello(){
    printf("Hello CUDA World!!!\n");
}

int main() {
    cuda_hello<<<1,1>>>();
    cuda_hello<<<1,1>>>();
    cuda_hello<<<1,1>>>();
    cudaDeviceSynchronize();
    printf("Hello CPU world\n");
    return 0;
}

The output of this code is deterministic. It will print “Hello CUDA world!!!” 3 times and then 1 time “Hello CPU world”.

Let us try the below code and try to understand what the output could be without executing it.

#include <stdio.h>
#include <cuda.h>

__global__ void cuda_hello(){
    printf("%d : Hello CUDA World!!!\n", threadIdx.x);
}

int main() {
    cuda_hello<<<1,100>>>();
    cudaDeviceSynchronize();
    printf("Hello CPU world\n");
    return 0;
}

You can see that “Hello CUDA world” gets printed 100 times, but the sequence is not same. In above code, we are executing the “cuda_hello” kernel using 100 threads. But we are not controlling the execution and return of GPU buffer.

in “cuda_hello<<<1,100>>>();” 100 is the number of thread and 1 is the number of block.

Homework

Try write the below code using CUDA kernel in efficient manner

for(int i = 0; i < 100; i++) {
    printf("i = %d", i);
}

Please keep in mind that our moto is to run it parallely.

Hint: lets print each line in one thread i.e. we can use 100 thread to print it.

References

Leave a Reply

Your email address will not be published. Required fields are marked *