VAIBHAV LODHA ·KLOVR.AI·DAY 02

FIELD NOTES / 02

Day 02 — Phase 1 — 90 minutes

Math-free. No equations.

Atransformerreadsallyourwordsatonce,weighswhatmatters,picksthenexttoken.

Day 1 gave you the mental model. Day 2 opens the hood — no equations, no linear algebra. Just the actual mechanics, explained the way they work.

Reading time

15 minutes

— or one long espresso

scroll

§ 01 — tokenization

§ 01 / tokenization

text → numbers → meaning

Before the model sees a single word, it splits everything into sub-word chunks called tokens.

Not characters. Not words. Sub-words — the smallest units that let the model handle any text without an infinite vocabulary. GPT-4 has ~100,000 of them.

Why it matters

→You pay per token, not per word

→Hindi costs 3–4× more than English

→Code tokenises differently than prose

→Numbers are notoriously expensive

tokenizer demo

> tokenization is powerful

tokenization is powerful

4 tokens · 24 charsratio: 6.0 chars/token

4 tokens — sub-word splitting keeps vocab small

"Hello, how are you?" in different languages — token cost

English

4 tok

Spanish

5 tok

French

5 tok

Hindi

15 tok

Arabic

13 tok

Chinese

7 tok

§ 02 / embeddings

words as coordinates

Every token becomes a point in high-dimensional space. Meaning = distance.

GPT-3's largest model used 12,288 dimensions (documented in the 2020 paper). Modern models use similar or larger spaces. We can only visualise 2, but the principle holds: words that mean similar things cluster together. The model doesn't know language — it knows geometry.

semantic dimension 1 →

semantic dimension 2 →

king

queen

prince

dog

cat

horse

Paris

London

Tokyo

Python

React

Claude

royalty

animals

cities

code

vector arithmetic

Because meaning is geometry, you can do arithmetic on concepts. The famous one:

gender direction in embedding space

king−man

+woman

=queen≈

Embedding

A vector (list of numbers) representing a token's meaning in context.

Dimensions

GPT-3 (documented) used 12,288. Each dimension captures some aspect of meaning — no one assigned them. OpenAI hasn't disclosed GPT-4's architecture.

Distance

Cosine similarity: vectors pointing the same direction = similar meaning.

↑ this is what makes semantic search work

§ 03 / self-attention

the core mechanism

Every word asks: which other words should I pay attention to?

This is self-attention. Each token simultaneously looks at all other tokens and decides what matters. "It" in the sentence on the right attends almost entirely to "animal" — that's pronoun disambiguation, solved by geometry.

Before transformers, RNNs processed tokens one-by-one, left to right. Attention replaced that with parallel, global lookup — and it's why transformers scaled.

Query (Q)

What this token is looking for.

Key (K)

What this token has to offer.

Value (V)

What gets passed if there's a match.

A 70B model runs 64 attention heads in parallel. Each head learns to track different relationships — syntax, coreference, semantics, all at once.

click any word → see what it attends to

focusing on: it

attention weights from "it"

The

animal

48%

didn't

cross

the

street

12%

because

was

10%

tired

↑ "it" → "animal" 48% — pronoun resolved

§ 04 / temperature & sampling

the randomness knob

The model gives you probabilities. You choose how random to be.

After the transformer computes logits (raw scores), a softmax converts them to probabilities. Temperature divides the logits before softmax — low = confident, high = chaotic. Then a sampling strategy picks the winner.

prompt: "The capital of France is ___"

strategy: Top-K · temperature: 0.70

deterministic (0.0)chaotic (2.0)

0.70

Paris

90.1%

Lyon

4.5%

the

2.9%

1.2%

France

0.8%

there

0.5%

grey = excluded from sampling∑ active = 97%

↑ drag the slider — watch the distribution collapse or flatten

§ 05 / wait, what?

DAY 02 · 6 FACTS

Things that'll
rewire your mental model

These are the things no one tells you in the blog posts.

01 / 06

Transformers have no idea about order — by default

Self-attention treats input as a set, not a sequence. Word order only enters because you explicitly add positional encodings. Without them, 'the cat sat' = 'sat cat the.' The original paper used sinusoidal encodings; modern models use learned or rotary (RoPE) variants.

why it matters: Position encoding is an engineering patch bolted on top of attention — not inherent to the architecture.

02 / 06

A 70B model is 140GB of floating point numbers

70 billion parameters × 2 bytes (fp16) = 140GB. That's your entire 'AI brain' — nothing more than a giant matrix multiplication engine operating on learned weights.

why it matters: This is why you can't run Llama 70B on your laptop but can run 7B: it's literally just RAM.

03 / 06

Each transformer layer adds to the answer, not replaces it

Information flows through a residual stream — every layer adds a small update (delta) on top of what came before, not a full replacement. OpenAI has never disclosed GPT-4's exact depth, but research models like Llama 3 70B have 80 layers. Early layers parse syntax; later layers handle reasoning.

why it matters: Because of residual connections, you can sometimes remove late layers with surprisingly little quality loss.

04 / 06

The model stores MORE concepts than it has neurons

Through superposition, models represent more features than dimensions by overlapping them. A neuron doesn't represent one concept — it represents many, at an angle. This is why interpretability is hard.

why it matters: Linear representation hypothesis: it's all vectors, just weirdly packed.

05 / 06

Training a frontier model costs tens to hundreds of millions of dollars

Analysts estimate GPT-4's training run cost between $63M–$100M+ in compute (SemiAnalysis, 2023). OpenAI has never confirmed. Training requires gradient descent over trillions of tokens on thousands of H100s for months. Inference is much cheaper — a few cents per call — but training is a one-time capital bet.

why it matters: This is why open-source (Llama, Mistral) matters: you can fine-tune a pre-trained model without paying the training bill.

06 / 06

Context window = working memory. There is no long-term memory by default

Everything the model 'knows' in a conversation fits in the context window. Past that, it forgets. There's no retrieval, no hard disk — just the current token sequence. Memory is a product feature, not a model feature.

why it matters: RAG, memory systems, and fine-tuning all exist to work around this fundamental constraint.

§ 06 / recap

the chain

Text→tokens (BPE)

Tokens→embeddings (vectors)

Embeddings→attention (Q·K·V)

Attention→logits → softmax → token

← the full pipeline

the three knobs

Temperature

controls confidence vs randomness

Top-K

limits candidate pool by rank

Top-P

limits by cumulative probability

← tune these in your API calls

§ 07 / homework

Two things before next session.

Don't skip these. Day 3 builds directly on them.

Paste a paragraph into the OpenAI tokenizer. Count the tokens. Then paste the same paragraph in Hindi or Arabic. Compare.

platform.openai.com/tokenizer — takes 2 minutes, sticks forever.

Call the Claude API (or any LLM API) with temperature=0, then temperature=1.5, same prompt. Notice the difference.

If you don't have API access, use Claude.ai and notice how responses change on re-runs.

next up

Day 3 → Prompt Engineering.

System prompts, few-shot, chain-of-thought. The craft of talking to LLMs.

—System prompt anatomy

—Few-shot prompting

—Chain-of-thought (CoT)

—Prompt injection & jailbreaks

—When to prompt vs fine-tune

← revisit Day 01 — The Only Mental Model

Atransformerreadsallyourwordsatonce,weighswhatmatters,picksthenexttoken.

Things that'llrewire your mental model

Two things before next session.

Things that'll
rewire your mental model