Day 02 — Phase 1 — 90 minutes

Atransformerreadsallyourwordsatonce,weighswhatmatters,picksthenexttoken.

Day 1 gave you the mental model. Day 2 opens the hood — no equations, no linear algebra. Just the actual mechanics, explained the way they work.

Reading time
15 minutes
— or one long espresso
scroll
§ 01 — tokenization
§ 01 / tokenization

Before the model sees a single word, it splits everything into sub-word chunks called tokens.

Not characters. Not words. Sub-words — the smallest units that let the model handle any text without an infinite vocabulary. GPT-4 has ~100,000 of them.

Why it matters
You pay per token, not per word
Hindi costs 3–4× more than English
Code tokenises differently than prose
Numbers are notoriously expensive
tokenizer demo
> tokenization is powerful
tokenization is powerful
4 tokens · 24 charsratio: 6.0 chars/token
4 tokens — sub-word splitting keeps vocab small
"Hello, how are you?" in different languages — token cost
English
4 tok
Spanish
5 tok
French
5 tok
Hindi
15 tok
Arabic
13 tok
Chinese
7 tok
§ 02 / embeddings

Every token becomes a point in high-dimensional space. Meaning = distance.

GPT-3's largest model used 12,288 dimensions (documented in the 2020 paper). Modern models use similar or larger spaces. We can only visualise 2, but the principle holds: words that mean similar things cluster together. The model doesn't know language — it knows geometry.

semantic dimension 1 →
semantic dimension 2 →
king
queen
prince
dog
cat
horse
Paris
London
Tokyo
Python
React
Claude
royalty
animals
cities
code
vector arithmetic

Because meaning is geometry, you can do arithmetic on concepts. The famous one:

gender direction in embedding space
kingman
+woman
=queen
Embedding
A vector (list of numbers) representing a token's meaning in context.
Dimensions
GPT-3 (documented) used 12,288. Each dimension captures some aspect of meaning — no one assigned them. OpenAI hasn't disclosed GPT-4's architecture.
Distance
Cosine similarity: vectors pointing the same direction = similar meaning.
↑ this is what makes semantic search work
§ 03 / self-attention

Every word asks: which other words should I pay attention to?

This is self-attention. Each token simultaneously looks at all other tokens and decides what matters. "It" in the sentence on the right attends almost entirely to "animal" — that's pronoun disambiguation, solved by geometry.

Before transformers, RNNs processed tokens one-by-one, left to right. Attention replaced that with parallel, global lookup — and it's why transformers scaled.

Query (Q)
What this token is looking for.
Key (K)
What this token has to offer.
Value (V)
What gets passed if there's a match.
A 70B model runs 64 attention heads in parallel. Each head learns to track different relationships — syntax, coreference, semantics, all at once.
click any word → see what it attends to
focusing on: it
attention weights from "it"
The
2%
animal
48%
didn't
4%
cross
8%
the
2%
street
12%
because
6%
it
1%
was
10%
tired
7%
↑ "it" → "animal" 48% — pronoun resolved
§ 04 / temperature & sampling

The model gives you probabilities. You choose how random to be.

After the transformer computes logits (raw scores), a softmax converts them to probabilities. Temperature divides the logits before softmax — low = confident, high = chaotic. Then a sampling strategy picks the winner.

prompt: "The capital of France is ___"
strategy: Top-K · temperature: 0.70
deterministic (0.0)chaotic (2.0)
0.70
Paris
90.1%
Lyon
4.5%
the
2.9%
a
1.2%
France
0.8%
there
0.5%
grey = excluded from sampling∑ active = 97%
↑ drag the slider — watch the distribution collapse or flatten
§ 05 / wait, what?

Things that'll
rewire your mental model

These are the things no one tells you in the blog posts.

01 / 06
Transformers have no idea about order — by default

Self-attention treats input as a set, not a sequence. Word order only enters because you explicitly add positional encodings. Without them, 'the cat sat' = 'sat cat the.' The original paper used sinusoidal encodings; modern models use learned or rotary (RoPE) variants.

why it matters: Position encoding is an engineering patch bolted on top of attention — not inherent to the architecture.
02 / 06
A 70B model is 140GB of floating point numbers

70 billion parameters × 2 bytes (fp16) = 140GB. That's your entire 'AI brain' — nothing more than a giant matrix multiplication engine operating on learned weights.

why it matters: This is why you can't run Llama 70B on your laptop but can run 7B: it's literally just RAM.
03 / 06
Each transformer layer adds to the answer, not replaces it

Information flows through a residual stream — every layer adds a small update (delta) on top of what came before, not a full replacement. OpenAI has never disclosed GPT-4's exact depth, but research models like Llama 3 70B have 80 layers. Early layers parse syntax; later layers handle reasoning.

why it matters: Because of residual connections, you can sometimes remove late layers with surprisingly little quality loss.
04 / 06
The model stores MORE concepts than it has neurons

Through superposition, models represent more features than dimensions by overlapping them. A neuron doesn't represent one concept — it represents many, at an angle. This is why interpretability is hard.

why it matters: Linear representation hypothesis: it's all vectors, just weirdly packed.
05 / 06
Training a frontier model costs tens to hundreds of millions of dollars

Analysts estimate GPT-4's training run cost between $63M–$100M+ in compute (SemiAnalysis, 2023). OpenAI has never confirmed. Training requires gradient descent over trillions of tokens on thousands of H100s for months. Inference is much cheaper — a few cents per call — but training is a one-time capital bet.

why it matters: This is why open-source (Llama, Mistral) matters: you can fine-tune a pre-trained model without paying the training bill.
06 / 06
Context window = working memory. There is no long-term memory by default

Everything the model 'knows' in a conversation fits in the context window. Past that, it forgets. There's no retrieval, no hard disk — just the current token sequence. Memory is a product feature, not a model feature.

why it matters: RAG, memory systems, and fine-tuning all exist to work around this fundamental constraint.
§ 06 / recap
01
the chain
Texttokens (BPE)
Tokensembeddings (vectors)
Embeddingsattention (Q·K·V)
Attentionlogits → softmax → token
← the full pipeline
02
the three knobs
Temperature
controls confidence vs randomness
Top-K
limits candidate pool by rank
Top-P
limits by cumulative probability
← tune these in your API calls
§ 07 / homework

Two things before next session.

Don't skip these. Day 3 builds directly on them.

01

Paste a paragraph into the OpenAI tokenizer. Count the tokens. Then paste the same paragraph in Hindi or Arabic. Compare.

platform.openai.com/tokenizer — takes 2 minutes, sticks forever.

02

Call the Claude API (or any LLM API) with temperature=0, then temperature=1.5, same prompt. Notice the difference.

If you don't have API access, use Claude.ai and notice how responses change on re-runs.

next up
Day 3 → Prompt Engineering.
System prompts, few-shot, chain-of-thought. The craft of talking to LLMs.
System prompt anatomy
Few-shot prompting
Chain-of-thought (CoT)
Prompt injection & jailbreaks
When to prompt vs fine-tune