Interactive · Guide

How LLMs Work

4 steps from raw text to output — click and interact to learn

Tokenization

Split text into small "tokens" with numeric IDs

A model doesn't read text character by character — it splits it into tokens, which can be words, parts of words, or punctuation. Each token gets a numeric ID for lookup in the embedding table.

Example · click a token to see its ID

Numbers below = token IDs in vocabulary

💡

"tokenization" splits into token + ization — this subword approach lets models handle unseen words by recognising familiar pieces.

Embeddings

Map each token ID to a high-dimensional numeric vector

Each token ID is looked up in an embedding table to retrieve a vector of hundreds of numbers. Words with similar meanings end up geometrically close — that's how the model "knows" king and queen are related.

vector space (2D projection) · hover to explore

💡

Vector arithmetic works: king − man + woman ≈ queen — the dashed arrows show this relationship. The model learns it entirely from data, nothing is hard-coded.

Self-Attention

Each token "looks at" other tokens to gather context

Attention lets the model understand context — when processing one token, the model scores every other token. A high attention weight means that token is important for understanding the current one.

Click a token to see its attention weights

Low weight

High weight

💡

A Transformer runs multiple attention heads in parallel — each learns different patterns: some track syntax, some track coreference (pronouns → nouns), some track semantic similarity. Their outputs are concatenated and passed to the next layer.

Token Generation

Pick the next token from a probability distribution

After all attention layers, the model produces a probability distribution over the entire vocabulary (~50,000 tokens) and samples from it using a parameter called Temperature.

context

The weather today is

Temperature: 1.0

🥶 deterministic (0.1) 🔥 random (2.0)

0.1 · deterministic 2.0 · random

top 5 candidate tokens

💡

Low temperature (0.1) → always picks the highest-probability token. Great for code and facts.
High temperature (1.5+) → flattens the distribution, producing more varied output. Better for creative writing.