Abstract GPU grid illustration in indigo tones
1 min read

GPU Kernels — First Notes

Notes from learning CUDA memory hierarchy, occupancy, and writing my first custom kernels.

Moving from PyTorch high-level APIs to CUDA kernels is like switching from driving to rebuilding the engine. These are my first structured notes.

Memory hierarchy

LevelLatencyCapacity
Registers~1 cycletiny
Shared memory~20 cycles48–228 KB/SM
L2~200 cyclesMB scale
HBM~400+ cyclesGB scale

A minimal vector add

__global__ void vector_add(const float* a, const float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Occupancy tradeoffs

More threads per block is not always better. Shared memory and register usage per block directly limits how many blocks can run concurrently on an SM.

Next steps

  • Triton for higher-level kernel authoring
  • Roofline analysis for kernel bottlenecks
  • Integrating custom ops with PyTorch