Beginner

Tokens, Context Windows & Parameters

Every time you use an LLM, three things shape the interaction: how your text is broken into tokens, how much text fits in the context window, and how parameters like temperature control the output. Understand these and you understand the invisible constraints of every AI conversation.

Tokens — The Atoms of LLMs

What Are Tokens?

LLMs don't read text the way you do — word by word. They break text into tokens, which are chunks of characters. A token might be a whole word, part of a word, a single character, or even a punctuation mark.

Example tokenization:

"Tokenization is fascinating!"

→ ["Token", "ization", " is", " fascinating", "!"]

= 5 tokens

Rules of Thumb

1 token ≈ 4 characters in English (or about ¾ of a word)
1,000 tokens ≈ 750 words — a useful approximation for estimation
Common words are usually 1 token: "the", "is", "and"
Uncommon words get split: "tokenization" → "token" + "ization"
Numbers are often multiple tokens: "2024" might be "20" + "24"
Non-English languages typically use more tokens per word
Code is often more token-dense than prose

Why Tokens Matter to You

Cost

API pricing is per token (input + output). More tokens = more cost.

→

Speed

More output tokens = longer wait. Measured in tokens/sec.

→

Context limits

Prompt + response must fit within the context window (in tokens).

→

Quality

Very long prompts can degrade quality — attention spreads thin.

Every token you send or receive has cost, speed, and quality implications

Context Windows — The Model's Working Memory

What Is a Context Window?

The context window is the total amount of text (in tokens) that an LLM can "see" at once — your prompt, any system instructions, conversation history, and the response it's generating. Everything must fit inside this window.

Think of it like a whiteboard with a fixed size. Once it's full, the model can't see anything beyond it.

Context Window Sizes (2025)

Model	Context Window	Rough Equivalent
GPT-4o	128K tokens	~96,000 words (~300 pages)
Claude 3.5 Sonnet	200K tokens	~150,000 words (~500 pages)
Gemini 2.5 Pro	1M+ tokens	~750,000 words (several books)
LLaMA 3 (default)	8K tokens	~6,000 words (~20 pages)
Mistral 7B	32K tokens	~24,000 words (~80 pages)

Context Window Tradeoffs

Bigger is not always better — Models can struggle to find relevant information in very long contexts ("lost in the middle" problem)
Cost scales linearly — Filling a 128K context window costs 16x more than an 8K window
Latency increases — Processing longer contexts takes more time before the first token appears
Quality may degrade — Attention is spread thinner across longer inputs

Practical tip: Don't dump everything into the context window because you can. Give the model only what it needs. Shorter, focused contexts generally produce better results than massive context dumps.

Temperature — Controlling Randomness

What Is Temperature?

When an LLM predicts the next token, it calculates a probability for every possible token. Temperature controls how the model chooses among these possibilities:

Temperature 0

Deterministic — always picks most probable token

Temperature 1.5+

Very random — increasingly chaotic output

Code gen (0–0.2)

Factual Q&A (0.3)

Summarization (0.5)

Conversation (0.7)

Creative writing (1.0)

Brainstorm (1.2)

Temperature 0.3–0.7 is the sweet spot for most production tasks

Same prompt at different temperatures:

temp=0:"The best programming language for beginners is Python."

temp=0.7:"Python is widely considered the best starting language, though JavaScript is also a strong choice."

temp=1.5:"Honestly? Scratch. No, wait — maybe Lua. Actually, have you considered learning to talk to ducks first?"

When to Adjust Temperature

Task	Recommended Temperature	Why
Code generation	0 - 0.2	Code needs to be correct, not creative
Factual Q&A	0 - 0.3	Facts should be consistent
Summarization	0.3 - 0.5	Needs accuracy with natural phrasing
General conversation	0.5 - 0.8	Natural and varied
Creative writing	0.7 - 1.0	Variety and surprise are the point
Brainstorming	0.8 - 1.2	Want unexpected ideas

Top-p (Nucleus Sampling)

Top-p is another way to control randomness. Instead of adjusting how "spread out" the probabilities are (temperature), top-p limits which tokens are even considered:

top-p = 1.0 — Consider all tokens (no filtering)
top-p = 0.9 — Only consider tokens in the top 90% of cumulative probability
top-p = 0.1 — Only consider the very top tokens (very focused)

In practice, most people adjust either temperature or top-p, not both. If you're just starting out, stick with temperature and leave top-p at 1.0.

Other Parameters You'll Encounter

Max tokens — The maximum number of tokens the model will generate in its response. Setting this too low cuts the response short; too high wastes budget if you're paying per token.
Stop sequences — Strings that tell the model to stop generating. Useful for structured output: "stop when you see \n\n" or "stop at END".
Frequency penalty — Reduces repetition by penalizing tokens that have already appeared. Useful when the model gets stuck in loops.
Presence penalty — Encourages the model to talk about new topics by penalizing any token that has appeared at all (regardless of frequency).
System prompt — Not technically a "parameter" but acts like one. It sets the model's behavior, persona, and constraints for the entire conversation.

Putting It All Together

When you send a message to an LLM, here's the full picture:

Tokenize

Text → token IDs

→

Context window

System prompt + history + your message

→

Forward pass

Model processes all tokens through layers

→

Sample

Temperature + top-p select next token

→

Repeat

Until max tokens or stop sequence

→

Bill

Input tokens + output tokens

Every message goes through this loop — understanding it helps you optimize cost, speed, and quality

Checklist: Do You Understand This?

Can you estimate how many tokens are in a 1,000-word document?
What happens if your prompt + response exceeds the context window?
What temperature would you use for code generation vs. brainstorming?
Why is a 128K context window not always better than 8K?
What does "max tokens" control?