Tokens, Context Windows & Parameters
Every time you use an LLM, three things shape the interaction: how your text is broken into tokens, how much text fits in the context window, and how parameters like temperature control the output. Understand these and you understand the invisible constraints of every AI conversation.
Tokens — The Atoms of LLMs
What Are Tokens?
LLMs don't read text the way you do — word by word. They break text into tokens, which are chunks of characters. A token might be a whole word, part of a word, a single character, or even a punctuation mark.
Example tokenization:
"Tokenization is fascinating!"
→ ["Token", "ization", " is", " fascinating", "!"]
= 5 tokens
Rules of Thumb
- 1 token ≈ 4 characters in English (or about ¾ of a word)
- 1,000 tokens ≈ 750 words — a useful approximation for estimation
- Common words are usually 1 token: "the", "is", "and"
- Uncommon words get split: "tokenization" → "token" + "ization"
- Numbers are often multiple tokens: "2024" might be "20" + "24"
- Non-English languages typically use more tokens per word
- Code is often more token-dense than prose
Why Tokens Matter to You
Every token you send or receive has cost, speed, and quality implications
Context Windows — The Model's Working Memory
What Is a Context Window?
The context window is the total amount of text (in tokens) that an LLM can "see" at once — your prompt, any system instructions, conversation history, and the response it's generating. Everything must fit inside this window.
Think of it like a whiteboard with a fixed size. Once it's full, the model can't see anything beyond it.
Context Window Sizes (2025)
| Model | Context Window | Rough Equivalent |
|---|---|---|
| GPT-4o | 128K tokens | ~96,000 words (~300 pages) |
| Claude 3.5 Sonnet | 200K tokens | ~150,000 words (~500 pages) |
| Gemini 2.5 Pro | 1M+ tokens | ~750,000 words (several books) |
| LLaMA 3 (default) | 8K tokens | ~6,000 words (~20 pages) |
| Mistral 7B | 32K tokens | ~24,000 words (~80 pages) |
Context Window Tradeoffs
- Bigger is not always better — Models can struggle to find relevant information in very long contexts ("lost in the middle" problem)
- Cost scales linearly — Filling a 128K context window costs 16x more than an 8K window
- Latency increases — Processing longer contexts takes more time before the first token appears
- Quality may degrade — Attention is spread thinner across longer inputs
Practical tip: Don't dump everything into the context window because you can. Give the model only what it needs. Shorter, focused contexts generally produce better results than massive context dumps.
Temperature — Controlling Randomness
What Is Temperature?
When an LLM predicts the next token, it calculates a probability for every possible token. Temperature controls how the model chooses among these possibilities:
Temperature 0.3–0.7 is the sweet spot for most production tasks
Same prompt at different temperatures:
When to Adjust Temperature
| Task | Recommended Temperature | Why |
|---|---|---|
| Code generation | 0 - 0.2 | Code needs to be correct, not creative |
| Factual Q&A | 0 - 0.3 | Facts should be consistent |
| Summarization | 0.3 - 0.5 | Needs accuracy with natural phrasing |
| General conversation | 0.5 - 0.8 | Natural and varied |
| Creative writing | 0.7 - 1.0 | Variety and surprise are the point |
| Brainstorming | 0.8 - 1.2 | Want unexpected ideas |
Top-p (Nucleus Sampling)
Top-p is another way to control randomness. Instead of adjusting how "spread out" the probabilities are (temperature), top-p limits which tokens are even considered:
- top-p = 1.0 — Consider all tokens (no filtering)
- top-p = 0.9 — Only consider tokens in the top 90% of cumulative probability
- top-p = 0.1 — Only consider the very top tokens (very focused)
In practice, most people adjust either temperature or top-p, not both. If you're just starting out, stick with temperature and leave top-p at 1.0.
Other Parameters You'll Encounter
- Max tokens — The maximum number of tokens the model will generate in its response. Setting this too low cuts the response short; too high wastes budget if you're paying per token.
- Stop sequences — Strings that tell the model to stop generating. Useful for structured output: "stop when you see \n\n" or "stop at END".
- Frequency penalty — Reduces repetition by penalizing tokens that have already appeared. Useful when the model gets stuck in loops.
- Presence penalty — Encourages the model to talk about new topics by penalizing any token that has appeared at all (regardless of frequency).
- System prompt — Not technically a "parameter" but acts like one. It sets the model's behavior, persona, and constraints for the entire conversation.
Putting It All Together
When you send a message to an LLM, here's the full picture:
Every message goes through this loop — understanding it helps you optimize cost, speed, and quality
Checklist: Do You Understand This?
- Can you estimate how many tokens are in a 1,000-word document?
- What happens if your prompt + response exceeds the context window?
- What temperature would you use for code generation vs. brainstorming?
- Why is a 128K context window not always better than 8K?
- What does "max tokens" control?