A Mathematical Framework for Transformer Circuits
About this week
This is a founding document of transformer mechanistic interpretability. The authors rewrite attention-only transformers in a mathematically equivalent form where the weights become directly readable: the residual stream is a passive communication channel, each head is an independent additive unit, and every head splits into a QK circuit (where to attend) and an OV circuit (what to write). Under this lens, a one-layer model is just an ensemble of bigram and “skip-trigram” tables — bugs included. The payoff comes at two layers, where heads compose through the residual stream: K-composition with a previous-token head produces induction heads, which find earlier occurrences of the current token and copy what came next — a real in-context algorithm rather than a lookup table, and the paper’s candidate mechanism for in-context learning in large models. Discussion at 8 pm, (optional) quiet reading from 7 pm.