Building a Transformer from Scratch — What, Why, and How
Ziad Tamim / June 27, 2026 • 5 min read

The Transformer is the foundation of every large language model you've heard of — GPT, LLaMA, BERT, Gemini. But most people use it as a black box. In this course, we build one from scratch in PyTorch — every component in its own file, every tensor shape explained, every piece of math written out. By the end you'll know exactly what happens between "input text" and "generated output."
Before the Transformer — The RNN Problem
Before 2017, sequence modelling was dominated by Recurrent Neural Networks (RNNs). The idea was simple: process one token at a time, carry a hidden state forward, and let that state accumulate memory of what came before.

It worked — until it didn't. RNNs have three fundamental problems:
- Slow sequential computation. Each token must wait for the previous one to finish. You can't parallelise across a sequence, which makes training on long texts painfully slow.
- Vanishing gradients. Gradients have to travel backwards through every timestep. By the time they reach the early tokens, they've shrunk to near zero — the model simply stops learning long-range relationships.
- Poor long-range memory. Related to the above: information from 200 tokens ago is practically gone by the time you need it. The hidden state is a bottleneck with finite capacity.
These weren't small inconveniences — they were architectural limits that capped what RNNs could ever do.
The Transformer — Attention Is All You Need
In 2017, Vaswani et al. published Attention Is All You Need and replaced the recurrence entirely with a mechanism called self-attention.

Instead of processing tokens one at a time, self-attention looks at all tokens simultaneously and lets each one directly attend to every other. The result:
- Full parallelism. No sequential dependency — the entire sequence runs through the model at once.
- O(1) path length between any two tokens. Gradients flow directly between positions, so long-range learning actually works.
- Explicit attention patterns. The model learns which tokens to pay attention to, rather than hoping a hidden state carries that information forward.
The original paper showed an encoder-decoder architecture designed for translation. We build the decoder-only variant — the right-hand stack, with the cross-attention removed. This is the GPT family: a left-to-right language model that predicts the next token at every position.
What This Course Covers
This is a hands-on course. Every post corresponds to one component of the model, one Python file in the codebase, and one interactive visualisation. The learning path is:
| # | Component | What you'll understand |
|---|---|---|
| 01 | This post | RNNs, the Transformer idea, course structure |
| 02 | Dataset & Batching | How raw text becomes training batches |
| 03 | Tokenization | Converting text to integer IDs |
| 04 | Input Embedding | Token ID → learned vector |
| 05 | Positional Encoding | Giving the model a sense of order |
| 06 | Self-Attention | The core mechanism — Q, K, V, causal mask |
| 07 | Multi-Head Attention | Running attention in parallel across heads |
| 08 | Feed-Forward Network | The MLP inside every block |
| 09 | LayerNorm & Residuals | Keeping training stable |
| 10 | Transformer Block | Putting attention + FFN together |
| 11 | The Full GPT Model | End-to-end: embedding → blocks → LM head |
| 12 | Training | Loss, backprop, AdamW, checkpoints |
| 13 | Sampling & Generation | Temperature, top-k, autoregressive loop |
Each post follows the same structure: intuition → math → code → interactive widget so you can build understanding at multiple levels.
What You'll Build
A character-level GPT trained on the Tiny Shakespeare dataset (~1MB of text). It's small enough to train on a laptop GPU in under an hour, but architecturally identical to GPT-2. After training, the model generates Shakespeare-style text character by character:
ROMEO:
What light through yonder window breaks?
It is the east, and Juliet is the sun.
The full codebase is at github.com/Ziad-Tamim/nanoGPT-from-scratch — one Python file per component, fully tested, with math documentation alongside every module.
What You Need
To follow the theory posts: nothing — just read.
To run the code:
- Python 3.11+
- PyTorch 2.6+ (with CUDA if you have an NVIDIA GPU)
- uv for package management (or pip)
git clone https://github.com/Ziad-Tamim/nanoGPT-from-scratch
cd nanoGPT-from-scratch
uv sync
uv run python scripts/prepare_data.py # downloads Tiny Shakespeare
uv run python -m nanogpt.train # starts training
No prior transformer experience required — just basic Python and a rough idea of what a neural network is.
Learning Outcomes
By the end of this course you will:
- Understand every tensor shape that flows through a Transformer —
(B, T, C)won't be mysterious - Know why self-attention works — not just that it does, but what the Q, K, V matrices are actually computing
- Be able to read LLaMA, BERT, or GPT-2 source code and recognise every component
- Have trained a real language model and generated text from it
- Understand the gap between a base language model (what we build) and a chat assistant (instruction tuning + RLHF)
Resources
These are the resources that informed this course — worth reading alongside the posts:
- Attention Is All You Need — Vaswani et al. (2017) — the original paper
- The Illustrated Transformer — Jay Alammar — the best visual explanation
- Let's build GPT — Andrej Karpathy — the video that inspired this codebase
- The Annotated Transformer — Harvard NLP — line-by-line walkthrough of the original paper
Next up: Dataset & Batching — how we download Tiny Shakespeare, split it into training and validation sets, and turn raw text into the (B, T) batches the model consumes.
Recommended Reads


Transformer from Scratch #04 — Input Embedding
How raw text becomes a matrix of numbers the Transformer can work with — tokenization, the embedding table, and why meaning emerges from training.
TransformerDeep LearningNLPEmbeddingsFrom ScratchJune 27, 2026 • 6 min read
0 views

Conda Environment Cheatsheet — For Everyone Using Conda in VS Code
Quick reference guide so you don’t have to memorize all the commands when creating or using a Conda environment.
CondaCheatsheetcommandsAugust 12, 2025 • 2 min read
0 views