If you’re venturing into the world of parallel computing, specifically on NVIDIA’s GPUs, you’re likely to come across some unique terminologies. Understanding these terminologies is the foundation to effectively program with CUDA. In this post, we will demystify terms like Kernel, Thread, Block, Grid, Device, Host, and different types of memory in CUDA. Our aim is to provide you with clear, concise definitions and explanations to help you grasp the CUDA programming model and make your GPU programming journey smoother. So, whether you’re a beginner just starting out or a seasoned programmer looking for a refresher, this guide is for you!”

CUDA
Kernel
Thread

Kernel

It is a function run on the GPU that is executed N times in parallel by N different CUDA threads, as opposed to only once like regular C/C++ code. Below is an example of a kerel named myKernel. Here we have called the kernel to run on 100 Threads.

__global__ void myKernel(void)
{
    printf("Hello CUDA World!!!");
}

int main() {
    myKernel<<<1,100>>>();
    return 0;
}

Thread

A thread in CUDA is the fundamental unit of execution. Each CUDA thread runs an instance of a kernel and has its own set of registers. The threads are grouped into blocks, and these blocks collectively form a grid. Each thread has a unique ID within its block and the grid, and this ID is often used to calculate memory addresses and offsets.

Block

A group of threads that can cooperate with each other by sharing data through some shared memory. All threads in a block are guaranteed to be run on the same streaming multiprocessor and thus can interact with each other more efficiently than threads in different blocks.

Grid

A group of blocks that are executed on the GPU. All blocks in a grid can be run independently and possibly in parallel, depending on the capabilities of the device.

Device

The GPU itself. In CUDA programming, the device runs kernels, each of which is executed by multiple threads.

Host

The CPU and its memory (RAM). The host code is the part of the program that runs on the CPU and is used to manage and direct the computation on the device (GPU).

Memory Hierarchy

CUDA provides different memory types that can be used in different ways, including global, shared, constant, and local memory.

Global Memory

The main device memory, it can be read from and written to by all threads and from the host. It’s the largest and slowest memory available in CUDA.

Shared Memory

A faster memory available on each streaming multiprocessor, shared among all threads in a block and typically used for data that needs to be shared or communicated among the threads in the same block.

Constant Memory

Read-only memory for the device, it is cached and best suited for data that will not change over the course of a kernel execution.

Local Memory

Private per-thread memory, it is used for automatic variables that don’t fit into the device’s register file. Its use is often managed automatically by the CUDA compiler.

Streaming Multiprocessors (SMs)

The cores of the GPU where threads are executed. Each SM can schedule and execute multiple blocks at the same time.

GEMM

GEMM stands for GEneral Matrix Multiplication, a standard routine in linear algebra. It performs the multiplication of two matrices and adds the result to a third matrix. GEMM is a building block for many scientific and engineering applications, and its efficient implementation is crucial for leveraging the computational power of GPUs.

SGEMM

SGEMM stands for Single precision (fp32) GEneral Matrix Multiplication. It’s similar to GEMM, but deals with single-precision floating-point numbers. This operation is a key part of many scientific and machine learning applications, where the precision of double-precision numbers is not required, but the speed of single-precision computation is beneficial.

Asynchronous Operations

An asynchronous operation is defined as an operation that is initiated by a CUDA thread and is executed asynchronously as-if by another thread. In a well formed program one or more CUDA threads synchronize with the asynchronous operation. The CUDA thread that initiated the asynchronous operation is not required to be among the synchronizing threads.

Compute Capability

The compute capability of a device is represented by a version number, also sometimes called its “SM version”. This version number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the present GPU.

References

Leave a Reply

Your email address will not be published. Required fields are marked *