Microgpt explained interactively
Comments
Back
MicroGPT explained interactively
Andrej Karpathy wrote a 200-line Python script that trains and runs a GPT from scratch, with no libraries or dependencies, just pure Python. The script contains the algorithm that powers LLMs like ChatGPT.
Let's walk through it piece by piece and watch each part work. Andrej did a walkthrough on his blog, but here I take a more visual approach, tailored for beginners.
The dataset
The model trains on 32,000 human names, one per line: emma, olivia, ava, isabella, sophia... Each name is a document. The model's job is to learn the statistical patterns in these names and generate plausible new ones that sound like they could be real.
By the end of training, the model produces names like "kamon", "karai", "anna", and "anton".The model has learned which characters tend to follow which, which sounds are common at the start vs. the end, and how long a typical name runs. From ChatGPT's perspective, your conversation is just a document. When you type a prompt, the model's response is a statistical document completion.
Numbers, not letters
Neural networks work with numbers, not characters. So we need a way to convert text into a sequence of integers and back. The simplest possible tokenizer assigns one integer to each unique character in the dataset. The 26 lowercase letters get ids 0 through 25, and we add one special token called BOS (Beginning of Sequence) with id 26 that marks where a name starts and ends.
Type a name below and watch it get tokenized. Each character maps to its integer id, and BOS tokens wrap both ends:
Tokenizer
vocab size
sequence length
unique chars
a-z + BOS
The integer values themselves have no meaning. Token 4 isn't "more" than token 2. Each token is just a distinct symbol, like assigning a different color to each letter. Production tokenizers like tiktoken (used by GPT-4) work on chunks of characters for efficiency, giving a vocabulary of ~100,000 tokens, but the principle is the ssame.
The prediction game
Here's the core task: given the tokens we've seen so far, predict what comes next. We slide through the sequence one position at a time. At position 0, the model sees only BOS and must predict the first letter. At position 1, it sees BOS and the first letter and must predict the second letter. And so on.
Step through the sequence below and watch the context grow while the target shifts forward:
Next-token prediction
1 / 5
Each step produces one training example: the context on the left is the input, the green token on the right is what the model should predict. For the name "emma", that's five input-target pairs. This sliding window is how all language models train, including ChatGPT.
From scores to probabilities
At each position, the model outputs 27 raw numbers, one per possible next token. These numbers (called Logits) can be anything: positive, negative, large, small. We need to convert them into probabilities that are positive and sum to 1. Softmax does this by exponentiating each score and dividing by the total.
Adjust the logits below and watch the probability distribution change. Notice how one large logit dominates, and the exponential amplifies differences.
Softmax
a:2.0
b:1.0
c:0.5
d:-0.5
e:3.0
other:0.0
Here's the actual softmax code from microgpt. Step through it to see the intermediate values at each line:
Softmax step-through
PYTHON
1def softmax(logits):
2 max_val = max(val.data for val in logits)
3 exps = [(val - max_val).exp()
4 for val in logits]
5 total = sum(exps)
6 return [e / total for e in exps]
VALUES
1 / 5
The subtraction of the max value before exponentiating doesn't change the result mathematically (dividing numerator and denominator by the same constant cancels out) but prevents overflow. Without it, exp(100) would produce infinity.
Measuring surprise
How wrong was the prediction? We need a single number that captures "the model thought the correct answer was unlikely." If the model assigns probability 0.9 to the correct next token, the loss is low (0.1). If it assigns probability 0.01, the loss is high (4.6). The formula is −log(p)-\log(p)−log(p) where ppp is the probability the model assigned to the correct token. This is called cross-entropy loss.
Drag the slider to adjust the probability of the correct token and watch the loss change:
Cross-entropy loss
P(correct)30%
The curve has two properties that make it useful. First, it's zero when the model is perfectly confident in the right answer (p=1p = 1p=1). Second, it goes to infinity as the model assigns near-zero probability to the truth (p→0p \to 0p→0), which punishes confident wrong answers severely. Training minimizes this number.
Tracking every calculation
To improve, the model needs to answer: "for each of my 4,192 parameters, if I nudge it up by a tiny amount, does the loss go up or down, and by how much?" Backpropagation computes this by walking the computation backward, applying the chain rule at each step.
Every mathematical operation (add, multiply, exp, log) is a node in a graph. Each node remembers its inputs and knows its local derivative. The backward pass starts at the loss (where the Gradient is trivially 1.0) and multiplies local derivatives along every path back to the inputs.
Step through the forward pass, then the backward pass for a small example where L=a⋅b+aL = a \cdot b + aL=a⋅b+a with a=2,b=3a = 2, b = 3a=2,b=3:
Computation graph
FORWARD|BACKWARD
1 / 9
Now step through the actual Value class code. Watch how each operation records its children and local gradients, then how backward() walks the graph in reverse, accumulating gradients:
Value class and backward()
PYTHON
1class Value:
2 def __init__(self, data, children=(), local_grads=()):
3 self.data = data
4 self.grad = 0
5 self._children = children
6 self._local_grads = local_grads
8 def __add__(self, other):
9 return Value(self.data + other.data,
10 (self, other), (1, 1))
12 def __mul__(self, other):
13 return Value(self.data * other.data,
14 (self, other), (other.data, self.data))
16 def backward(self):
17 # topological sort
18 topo = []
19 visited = set()
20 def build_topo(v):
21 if v not in visited:
22 visited.add(v)
23 for child in v._children:
24 build_topo(child)
25 topo.append(v)
26 build_topo(self)
27 self.grad = 1
28 for v in reversed(topo):
29 for child, lg in zip(v._children, v._local_grads):
30 child.grad += lg * v.grad
STATE
creating values
VAR
DATA
GRAD
2.0
1 / 8
Notice that aaa has a gradient of 4.0, not 3.0. That's because aaa is used in two places: once in the multiplication (∂(a⋅b)/∂a=b=3\partial(a \cdot b)/\partial a = b = 3∂(a⋅b)/∂a=b=3) and once in the addition (∂(c+a)/∂a=1\partial(c + a)/\partial a = 1∂(c+a)/∂a=1). The gradients from both paths sum up: 3+1=43 + 1 = 43+1=4. This is the multivariable chain rule in action. If a value contributes to the loss through multiple paths, the total derivative is the sum of contributions from each path.
This is the same algorithm that PyTorch's loss.backward() runs, operating on scalars instead of tensors. Same algorithm, just smaller and slower.
From IDs to meaning
We know how to measure error and how to trace that error back to every parameter. Now let's build the model itself, starting with how it represents tokens.
[...]