Building a Transformer from Scratch — What, Why, and How

Ziad Tamim / June 27, 2026 • 5 min read

...

Building a Transformer from Scratch — What, Why, and How

The Transformer is the foundation of every large language model you've heard of — GPT, LLaMA, BERT, Gemini. But most people use it as a black box. In this course, we build one from scratch in PyTorch — every component in its own file, every tensor shape explained, every piece of math written out. By the end you'll know exactly what happens between "input text" and "generated output."

Before the Transformer — The RNN Problem

Before 2017, sequence modelling was dominated by Recurrent Neural Networks (RNNs). The idea was simple: process one token at a time, carry a hidden state forward, and let that state accumulate memory of what came before.

RNN and its limitations

It worked — until it didn't. RNNs have three fundamental problems:

Slow sequential computation. Each token must wait for the previous one to finish. You can't parallelise across a sequence, which makes training on long texts painfully slow.
Vanishing gradients. Gradients have to travel backwards through every timestep. By the time they reach the early tokens, they've shrunk to near zero — the model simply stops learning long-range relationships.
Poor long-range memory. Related to the above: information from 200 tokens ago is practically gone by the time you need it. The hidden state is a bottleneck with finite capacity.

These weren't small inconveniences — they were architectural limits that capped what RNNs could ever do.

The Transformer — Attention Is All You Need

In 2017, Vaswani et al. published Attention Is All You Need and replaced the recurrence entirely with a mechanism called self-attention.

Transformer Architecture

Instead of processing tokens one at a time, self-attention looks at all tokens simultaneously and lets each one directly attend to every other. The result:

Full parallelism. No sequential dependency — the entire sequence runs through the model at once.
O(1) path length between any two tokens. Gradients flow directly between positions, so long-range learning actually works.
Explicit attention patterns. The model learns which tokens to pay attention to, rather than hoping a hidden state carries that information forward.

The original paper showed an encoder-decoder architecture designed for translation. We build the decoder-only variant — the right-hand stack, with the cross-attention removed. This is the GPT family: a left-to-right language model that predicts the next token at every position.

What This Course Covers

This is a hands-on course. Every post corresponds to one component of the model, one Python file in the codebase, and one interactive visualisation. The learning path is:

#	Component	What you'll understand
01	This post	RNNs, the Transformer idea, course structure
02	Dataset & Batching	How raw text becomes training batches
03	Tokenization	Converting text to integer IDs
04	Input Embedding	Token ID → learned vector
05	Positional Encoding	Giving the model a sense of order
06	Self-Attention	The core mechanism — Q, K, V, causal mask
07	Multi-Head Attention	Running attention in parallel across heads
08	Feed-Forward Network	The MLP inside every block
09	LayerNorm & Residuals	Keeping training stable
10	Transformer Block	Putting attention + FFN together
11	The Full GPT Model	End-to-end: embedding → blocks → LM head
12	Training	Loss, backprop, AdamW, checkpoints
13	Sampling & Generation	Temperature, top-k, autoregressive loop

Each post follows the same structure: intuition → math → code → interactive widget so you can build understanding at multiple levels.

What You'll Build

A character-level GPT trained on the Tiny Shakespeare dataset (~1MB of text). It's small enough to train on a laptop GPU in under an hour, but architecturally identical to GPT-2. After training, the model generates Shakespeare-style text character by character:

ROMEO:
What light through yonder window breaks?
It is the east, and Juliet is the sun.

The full codebase is at github.com/Ziad-Tamim/nanoGPT-from-scratch — one Python file per component, fully tested, with math documentation alongside every module.

What You Need

To follow the theory posts: nothing — just read.

To run the code:

Python 3.11+
PyTorch 2.6+ (with CUDA if you have an NVIDIA GPU)
uv for package management (or pip)

git clone https://github.com/Ziad-Tamim/nanoGPT-from-scratch
cd nanoGPT-from-scratch
uv sync
uv run python scripts/prepare_data.py   # downloads Tiny Shakespeare
uv run python -m nanogpt.train          # starts training

No prior transformer experience required — just basic Python and a rough idea of what a neural network is.

Learning Outcomes

By the end of this course you will:

Understand every tensor shape that flows through a Transformer — (B, T, C) won't be mysterious
Know why self-attention works — not just that it does, but what the Q, K, V matrices are actually computing
Be able to read LLaMA, BERT, or GPT-2 source code and recognise every component
Have trained a real language model and generated text from it
Understand the gap between a base language model (what we build) and a chat assistant (instruction tuning + RLHF)

Resources

These are the resources that informed this course — worth reading alongside the posts:

Attention Is All You Need — Vaswani et al. (2017) — the original paper
The Illustrated Transformer — Jay Alammar — the best visual explanation
Let's build GPT — Andrej Karpathy — the video that inspired this codebase
The Annotated Transformer — Harvard NLP — line-by-line walkthrough of the original paper

Next up: Dataset & Batching — how we download Tiny Shakespeare, split it into training and validation sets, and turn raw text into the (B, T) batches the model consumes.

TransformerDeep LearningNLPPyTorchFrom Scratch