February 28, 2025
Transformer Architecture Notes
Evergreen reference on transformer blocks, attention, and positional encoding.
A compact reference for the transformer architecture as used in modern LLMs.
Core block
Input → LayerNorm → Multi-Head Attention → Residual
→ LayerNorm → FFN → Residual → Output
Multi-head attention
For each head :
Positional encoding
Modern models use RoPE (Rotary Position Embedding) rather than absolute sinusoidal encodings.
Key variants
- Encoder-only: BERT-style
- Decoder-only: GPT, Llama
- Encoder-decoder: T5, original Transformer