Transformer from Scratch #04 — Input Embedding

Ziad Tamim / June 27, 2026 • 6 min read

...

Transformer from Scratch #04 — Input Embedding

The Transformer is a mathematical machine — it only understands numbers. Before any attention or learning can happen, we need to convert raw text into a matrix of numbers. That conversion is called the input embedding, and it happens in two stages: tokenization, then the embedding lookup. Together they turn a sentence into a 3D block of numbers that the rest of the model can work with.

Stage 1 — Tokenization

The first step is splitting text into units called tokens and giving each one a unique integer ID.

In our model we use character-level tokenization — each unique character in the corpus is one token. The process is simple:

The corpus is scanned once to collect every unique character → this is the vocabulary
Each unique character is assigned an integer ID — just a dictionary lookup, no learning
Same character always gets the same ID
The entire corpus is then encoded into one long list of integers

Vocabulary (65 unique characters in Shakespeare):
'\n' → 0,  ' ' → 1,  '!' → 2,  ...,  'T' → 3,  'H' → 12,  'E' → 43 ...

So the word "THECAT" becomes a list of integers:

T → 3
H → 12
E → 43
C → 0
A → 4
T → 3      ← same character, same ID, always

Vocab size is just the count of unique tokens found in the corpus — 65 for Shakespeare, ~50,000 for GPT-2 which uses subword tokens instead of characters.

This step has no learning. It is pure bookkeeping — a dictionary lookup.

Stage 2 — The Embedding Table

Now we have a list of integers. But integers carry no meaning — the number 3 doesn't tell the model anything about the character T. We need to convert each ID into a rich vector of numbers that the model can reason about.

That's what the embedding table does.

The embedding table is a matrix of shape (V, C):

V rows — one row per token in the vocabulary
C columns — the model dimension (d_model), e.g. 128 or 384

Input Embedding Diagram

A matrix of shape (V, C) is created — V rows (one per vocab token), C columns (model dimension)
Filled with small random numbers at the start — the model knows nothing yet
When a token ID comes in, its row is pulled out of the table — just an index lookup, no computation
The output is a (B, T, C) tensor — B sequences, T tokens each, C features per token

When a token ID arrives, its row is pulled straight out of the table. That's it — no computation, no math. Just an index into a matrix.

Token ID 3 (T)  →  row 3  →  [1718.424, 354.345, 3.452, ..., 342.743, 53.622]
Token ID 12 (H) →  row 12 →  [1718.424, 354.345, 3.452, ..., 342.743, 53.622]
Token ID 43 (E) →  row 43 →  [524.456,  954.356, 171.424, ..., 3.452,  186.633]
Token ID 0  (C) →  row 0  →  [1.345,    952.207, 564.247, ..., 43.746, 835.942]
Token ID 4  (A) →  row 4  →  [354.345,  564.247, 835.942, ..., 3.452,  53.622]
Token ID 3  (T) →  row 3  →  [1718.424, 354.345, 3.452, ..., 342.743, 53.622]  ← identical to the first T

Both Ts pull out the exact same row — because they are the same ID pointing to the same position in the table.

What Does Each Number in the Vector Mean?

At initialisation: nothing. They are random.

After training: each of the C numbers encodes some learned feature about that token. You can think of each dimension as a "feature slot" — one might capture whether this character is a vowel, another might capture whether it commonly starts a word, another might capture its typical grammatical role. The model decides what goes in each slot by learning what helps it predict the next token.

Nobody programs these features in. They emerge purely from training.

Training — How the Random Numbers Become Meaningful

The random rows mean nothing initially — predictions are wrong
Backprop flows gradients back into the table and nudges the rows after every step
Only the rows that were used in that batch receive an update — if T appeared in the batch, row 3 gets nudged; rows for characters that weren't seen don't change
Over thousands of steps, tokens that appear in similar contexts drift toward similar vectors
The meaning is not programmed in — it emerges from predicting the next token correctly

After training, 'a' and 'e' end up closer together in the vector space than 'a' and '!' — not because we told it to, but because vowels appear around similar characters. This is what people mean when they say embeddings capture semantic meaning.

The Output — A 3D Tensor (B, T, C)

After the lookup, you don't just have one vector — you have a whole batch of sequences, each made up of multiple tokens. The output of the embedding stage is a 3D tensor with shape:

(B, T, C)

B = batch size        — how many sequences are processed in parallel (e.g. 32)
T = sequence length   — how many tokens per sequence (e.g. 128)
C = model dimension   — how many numbers per token vector (e.g. 384)

Random sampling batches means each batch is a fresh mix from across the entire corpus — no two batches are identical, and the model sees a wide variety of text at every step.

Batch 1:  [ [384 numbers], [384 numbers], ... ]  ← 128 rows
Batch 2:  [ [384 numbers], [384 numbers], ... ]  ← 128 rows
...
Batch 32: [ [384 numbers], [384 numbers], ... ]  ← 128 rows

Every single token in every single sequence now has its own row of C numbers. This is the raw material the rest of the Transformer operates on.

One Problem — The Table Has No Sense of Order

Notice that both Ts in THECAT — at position 0 and position 5 — get identical vectors. The embedding table has no concept of position. It only knows what a token is, not where it sits in the sequence.

This is a problem. "cat sat" and "sat cat" would produce the same set of embedding vectors, just in a different order — and since the Transformer reads everything at once, it couldn't tell them apart.

That's exactly why the next stage — Positional Encoding — exists. It adds a unique position signal on top of each token's embedding vector, so the model can distinguish position 0 from position 5, even when the token is the same.

Summary

The Transformer can't read text — it needs numbers
Tokenization scans the corpus once, assigns each unique character a fixed integer ID — no learning, just a dictionary. Vocab size = count of unique tokens in the corpus
The entire corpus is encoded into one long list of integers
The embedding table is a matrix of shape (V, C) — one row per token, C numbers wide, filled with random numbers at the start
Looking up an embedding is just pulling a row from the table — no computation
The rows are learned during training via backprop — only rows used in each batch get updated
Similar tokens end up with similar vectors — not by design, but as a side effect of predicting the next token correctly
The output is a (B, T, C) tensor — 32 batches × 128 tokens × 384 numbers each — a fresh random mix from across the corpus every step
The table has no concept of position — that problem is solved in the next stage

Next up: Positional Encoding — how the model learns where each token sits in the sequence, even though it reads all tokens simultaneously.

TransformerDeep LearningNLPEmbeddingsFrom Scratch