Building a Transformer from Scratch — What, Why, and How

Ziad Tamim / June 27, 2026 • 5 min read

...
Building a Transformer from Scratch — What, Why, and How

The Transformer is the foundation of every large language model you've heard of — GPT, LLaMA, BERT, Gemini. But most people use it as a black box. In this course, we build one from scratch in PyTorch — every component in its own file, every tensor shape explained, every piece of math written out. By the end you'll know exactly what happens between "input text" and "generated output."


Before the Transformer — The RNN Problem

Before 2017, sequence modelling was dominated by Recurrent Neural Networks (RNNs). The idea was simple: process one token at a time, carry a hidden state forward, and let that state accumulate memory of what came before.

RNN and its limitations

It worked — until it didn't. RNNs have three fundamental problems:

  • Slow sequential computation. Each token must wait for the previous one to finish. You can't parallelise across a sequence, which makes training on long texts painfully slow.
  • Vanishing gradients. Gradients have to travel backwards through every timestep. By the time they reach the early tokens, they've shrunk to near zero — the model simply stops learning long-range relationships.
  • Poor long-range memory. Related to the above: information from 200 tokens ago is practically gone by the time you need it. The hidden state is a bottleneck with finite capacity.

These weren't small inconveniences — they were architectural limits that capped what RNNs could ever do.


The Transformer — Attention Is All You Need

In 2017, Vaswani et al. published Attention Is All You Need and replaced the recurrence entirely with a mechanism called self-attention.

Transformer Architecture

Instead of processing tokens one at a time, self-attention looks at all tokens simultaneously and lets each one directly attend to every other. The result:

  • Full parallelism. No sequential dependency — the entire sequence runs through the model at once.
  • O(1) path length between any two tokens. Gradients flow directly between positions, so long-range learning actually works.
  • Explicit attention patterns. The model learns which tokens to pay attention to, rather than hoping a hidden state carries that information forward.

The original paper showed an encoder-decoder architecture designed for translation. We build the decoder-only variant — the right-hand stack, with the cross-attention removed. This is the GPT family: a left-to-right language model that predicts the next token at every position.


What This Course Covers

This is a hands-on course. Every post corresponds to one component of the model, one Python file in the codebase, and one interactive visualisation. The learning path is:

#ComponentWhat you'll understand
01This postRNNs, the Transformer idea, course structure
02Dataset & BatchingHow raw text becomes training batches
03TokenizationConverting text to integer IDs
04Input EmbeddingToken ID → learned vector
05Positional EncodingGiving the model a sense of order
06Self-AttentionThe core mechanism — Q, K, V, causal mask
07Multi-Head AttentionRunning attention in parallel across heads
08Feed-Forward NetworkThe MLP inside every block
09LayerNorm & ResidualsKeeping training stable
10Transformer BlockPutting attention + FFN together
11The Full GPT ModelEnd-to-end: embedding → blocks → LM head
12TrainingLoss, backprop, AdamW, checkpoints
13Sampling & GenerationTemperature, top-k, autoregressive loop

Each post follows the same structure: intuition → math → code → interactive widget so you can build understanding at multiple levels.


What You'll Build

A character-level GPT trained on the Tiny Shakespeare dataset (~1MB of text). It's small enough to train on a laptop GPU in under an hour, but architecturally identical to GPT-2. After training, the model generates Shakespeare-style text character by character:

ROMEO:
What light through yonder window breaks?
It is the east, and Juliet is the sun.

The full codebase is at github.com/Ziad-Tamim/nanoGPT-from-scratch — one Python file per component, fully tested, with math documentation alongside every module.


What You Need

To follow the theory posts: nothing — just read.

To run the code:

  • Python 3.11+
  • PyTorch 2.6+ (with CUDA if you have an NVIDIA GPU)
  • uv for package management (or pip)
git clone https://github.com/Ziad-Tamim/nanoGPT-from-scratch
cd nanoGPT-from-scratch
uv sync
uv run python scripts/prepare_data.py   # downloads Tiny Shakespeare
uv run python -m nanogpt.train          # starts training

No prior transformer experience required — just basic Python and a rough idea of what a neural network is.


Learning Outcomes

By the end of this course you will:

  • Understand every tensor shape that flows through a Transformer — (B, T, C) won't be mysterious
  • Know why self-attention works — not just that it does, but what the Q, K, V matrices are actually computing
  • Be able to read LLaMA, BERT, or GPT-2 source code and recognise every component
  • Have trained a real language model and generated text from it
  • Understand the gap between a base language model (what we build) and a chat assistant (instruction tuning + RLHF)

Resources

These are the resources that informed this course — worth reading alongside the posts:


Next up: Dataset & Batching — how we download Tiny Shakespeare, split it into training and validation sets, and turn raw text into the (B, T) batches the model consumes.

TransformerDeep LearningNLPPyTorchFrom Scratch

Recommended Reads

Subscribe to my newsletter

Get updates on my work and projects.

We care about your data. Read our privacy policy.