Transformer from Scratch #04 — Input Embedding
Ziad Tamim / June 27, 2026 • 6 min read

The Transformer is a mathematical machine — it only understands numbers. Before any attention or learning can happen, we need to convert raw text into a matrix of numbers. That conversion is called the input embedding, and it happens in two stages: tokenization, then the embedding lookup. Together they turn a sentence into a 3D block of numbers that the rest of the model can work with.
Stage 1 — Tokenization
The first step is splitting text into units called tokens and giving each one a unique integer ID.
In our model we use character-level tokenization — each unique character in the corpus is one token. The process is simple:
- The corpus is scanned once to collect every unique character → this is the vocabulary
- Each unique character is assigned an integer ID — just a dictionary lookup, no learning
- Same character always gets the same ID
- The entire corpus is then encoded into one long list of integers
Vocabulary (65 unique characters in Shakespeare):
'\n' → 0, ' ' → 1, '!' → 2, ..., 'T' → 3, 'H' → 12, 'E' → 43 ...
So the word "THECAT" becomes a list of integers:
T → 3
H → 12
E → 43
C → 0
A → 4
T → 3 ← same character, same ID, always
Vocab size is just the count of unique tokens found in the corpus — 65 for Shakespeare, ~50,000 for GPT-2 which uses subword tokens instead of characters.
This step has no learning. It is pure bookkeeping — a dictionary lookup.
Stage 2 — The Embedding Table
Now we have a list of integers. But integers carry no meaning — the number 3 doesn't tell the model anything about the character T. We need to convert each ID into a rich vector of numbers that the model can reason about.
That's what the embedding table does.
The embedding table is a matrix of shape (V, C):
- V rows — one row per token in the vocabulary
- C columns — the model dimension (
d_model), e.g. 128 or 384

- A matrix of shape (V, C) is created — V rows (one per vocab token), C columns (model dimension)
- Filled with small random numbers at the start — the model knows nothing yet
- When a token ID comes in, its row is pulled out of the table — just an index lookup, no computation
- The output is a (B, T, C) tensor — B sequences, T tokens each, C features per token
When a token ID arrives, its row is pulled straight out of the table. That's it — no computation, no math. Just an index into a matrix.
Token ID 3 (T) → row 3 → [1718.424, 354.345, 3.452, ..., 342.743, 53.622]
Token ID 12 (H) → row 12 → [1718.424, 354.345, 3.452, ..., 342.743, 53.622]
Token ID 43 (E) → row 43 → [524.456, 954.356, 171.424, ..., 3.452, 186.633]
Token ID 0 (C) → row 0 → [1.345, 952.207, 564.247, ..., 43.746, 835.942]
Token ID 4 (A) → row 4 → [354.345, 564.247, 835.942, ..., 3.452, 53.622]
Token ID 3 (T) → row 3 → [1718.424, 354.345, 3.452, ..., 342.743, 53.622] ← identical to the first T
Both Ts pull out the exact same row — because they are the same ID pointing to the same position in the table.
What Does Each Number in the Vector Mean?
At initialisation: nothing. They are random.
After training: each of the C numbers encodes some learned feature about that token. You can think of each dimension as a "feature slot" — one might capture whether this character is a vowel, another might capture whether it commonly starts a word, another might capture its typical grammatical role. The model decides what goes in each slot by learning what helps it predict the next token.
Nobody programs these features in. They emerge purely from training.
Training — How the Random Numbers Become Meaningful
- The random rows mean nothing initially — predictions are wrong
- Backprop flows gradients back into the table and nudges the rows after every step
- Only the rows that were used in that batch receive an update — if
Tappeared in the batch, row 3 gets nudged; rows for characters that weren't seen don't change - Over thousands of steps, tokens that appear in similar contexts drift toward similar vectors
- The meaning is not programmed in — it emerges from predicting the next token correctly
After training, 'a' and 'e' end up closer together in the vector space than 'a' and '!' — not because we told it to, but because vowels appear around similar characters. This is what people mean when they say embeddings capture semantic meaning.
The Output — A 3D Tensor (B, T, C)
After the lookup, you don't just have one vector — you have a whole batch of sequences, each made up of multiple tokens. The output of the embedding stage is a 3D tensor with shape:
(B, T, C)
B = batch size — how many sequences are processed in parallel (e.g. 32)
T = sequence length — how many tokens per sequence (e.g. 128)
C = model dimension — how many numbers per token vector (e.g. 384)
Random sampling batches means each batch is a fresh mix from across the entire corpus — no two batches are identical, and the model sees a wide variety of text at every step.
Batch 1: [ [384 numbers], [384 numbers], ... ] ← 128 rows
Batch 2: [ [384 numbers], [384 numbers], ... ] ← 128 rows
...
Batch 32: [ [384 numbers], [384 numbers], ... ] ← 128 rows
Every single token in every single sequence now has its own row of C numbers. This is the raw material the rest of the Transformer operates on.
One Problem — The Table Has No Sense of Order
Notice that both Ts in THECAT — at position 0 and position 5 — get identical vectors. The embedding table has no concept of position. It only knows what a token is, not where it sits in the sequence.
This is a problem. "cat sat" and "sat cat" would produce the same set of embedding vectors, just in a different order — and since the Transformer reads everything at once, it couldn't tell them apart.
That's exactly why the next stage — Positional Encoding — exists. It adds a unique position signal on top of each token's embedding vector, so the model can distinguish position 0 from position 5, even when the token is the same.
Summary
- The Transformer can't read text — it needs numbers
- Tokenization scans the corpus once, assigns each unique character a fixed integer ID — no learning, just a dictionary. Vocab size = count of unique tokens in the corpus
- The entire corpus is encoded into one long list of integers
- The embedding table is a matrix of shape
(V, C)— one row per token,Cnumbers wide, filled with random numbers at the start - Looking up an embedding is just pulling a row from the table — no computation
- The rows are learned during training via backprop — only rows used in each batch get updated
- Similar tokens end up with similar vectors — not by design, but as a side effect of predicting the next token correctly
- The output is a (B, T, C) tensor — 32 batches × 128 tokens × 384 numbers each — a fresh random mix from across the corpus every step
- The table has no concept of position — that problem is solved in the next stage
Next up: Positional Encoding — how the model learns where each token sits in the sequence, even though it reads all tokens simultaneously.
Recommended Reads


Building a Transformer from Scratch — What, Why, and How
An introduction to the Transformer architecture — why RNNs fell short, how the Transformer changed everything, and what you'll build in this hands-on course.
TransformerDeep LearningNLPPyTorchFrom ScratchJune 27, 2026 • 5 min read
0 views

Conda Environment Cheatsheet — For Everyone Using Conda in VS Code
Quick reference guide so you don’t have to memorize all the commands when creating or using a Conda environment.
CondaCheatsheetcommandsAugust 12, 2025 • 2 min read
0 views