February 28, 2025

Transformer Architecture Notes

Evergreen reference on transformer blocks, attention, and positional encoding.

A compact reference for the transformer architecture as used in modern LLMs.

Core block

Input → LayerNorm → Multi-Head Attention → Residual
     → LayerNorm → FFN → Residual → Output

Multi-head attention

For each head hh:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Positional encoding

Modern models use RoPE (Rotary Position Embedding) rather than absolute sinusoidal encodings.

Key variants

  • Encoder-only: BERT-style
  • Decoder-only: GPT, Llama
  • Encoder-decoder: T5, original Transformer