February 28, 2025

Transformer Architecture Notes

Evergreen reference on transformer blocks, attention, and positional encoding.

Deep Learning Transformers

A compact reference for the transformer architecture as used in modern LLMs.

Core block

Input → LayerNorm → Multi-Head Attention → Residual
     → LayerNorm → FFN → Residual → Output

Multi-head attention

For each head $h$ :

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Positional encoding

Modern models use RoPE (Rotary Position Embedding) rather than absolute sinusoidal encodings.

Key variants

Encoder-only: BERT-style
Decoder-only: GPT, Llama
Encoder-decoder: T5, original Transformer