GPU Kernels — First Notes · Harikanth Lingutla

Moving from PyTorch high-level APIs to CUDA kernels is like switching from driving to rebuilding the engine. These are my first structured notes.

Memory hierarchy

Level	Latency	Capacity
Registers	~1 cycle	tiny
Shared memory	~20 cycles	48–228 KB/SM
L2	~200 cycles	MB scale
HBM	~400+ cycles	GB scale

A minimal vector add

__global__ void vector_add(const float* a, const float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Occupancy tradeoffs

More threads per block is not always better. Shared memory and register usage per block directly limits how many blocks can run concurrently on an SM.

Next steps

Triton for higher-level kernel authoring
Roofline analysis for kernel bottlenecks
Integrating custom ops with PyTorch