1 min read
GPU Kernels — First Notes
Notes from learning CUDA memory hierarchy, occupancy, and writing my first custom kernels.
Moving from PyTorch high-level APIs to CUDA kernels is like switching from driving to rebuilding the engine. These are my first structured notes.
Memory hierarchy
| Level | Latency | Capacity |
|---|---|---|
| Registers | ~1 cycle | tiny |
| Shared memory | ~20 cycles | 48–228 KB/SM |
| L2 | ~200 cycles | MB scale |
| HBM | ~400+ cycles | GB scale |
A minimal vector add
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
Occupancy tradeoffs
More threads per block is not always better. Shared memory and register usage per block directly limits how many blocks can run concurrently on an SM.
Next steps
- Triton for higher-level kernel authoring
- Roofline analysis for kernel bottlenecks
- Integrating custom ops with PyTorch