🧠 All Things AI
Beginner

Tokens, Context Windows & Parameters

Every time you use an LLM, three things shape the interaction: how your text is broken into tokens, how much text fits in the context window, and how parameters like temperature control the output. Understand these and you understand the invisible constraints of every AI conversation.

Tokens — The Atoms of LLMs

What Are Tokens?

LLMs don't read text the way you do — word by word. They break text into tokens, which are chunks of characters. A token might be a whole word, part of a word, a single character, or even a punctuation mark.

Example tokenization:

"Tokenization is fascinating!"

→ ["Token", "ization", " is", " fascinating", "!"]

= 5 tokens

Rules of Thumb

  • 1 token ≈ 4 characters in English (or about ¾ of a word)
  • 1,000 tokens ≈ 750 words — a useful approximation for estimation
  • Common words are usually 1 token: "the", "is", "and"
  • Uncommon words get split: "tokenization" → "token" + "ization"
  • Numbers are often multiple tokens: "2024" might be "20" + "24"
  • Non-English languages typically use more tokens per word
  • Code is often more token-dense than prose

Why Tokens Matter to You

Cost
API pricing is per token (input + output). More tokens = more cost.
Speed
More output tokens = longer wait. Measured in tokens/sec.
Context limits
Prompt + response must fit within the context window (in tokens).
Quality
Very long prompts can degrade quality — attention spreads thin.

Every token you send or receive has cost, speed, and quality implications

Context Windows — The Model's Working Memory

What Is a Context Window?

The context window is the total amount of text (in tokens) that an LLM can "see" at once — your prompt, any system instructions, conversation history, and the response it's generating. Everything must fit inside this window.

Think of it like a whiteboard with a fixed size. Once it's full, the model can't see anything beyond it.

Context Window Sizes (2025)

ModelContext WindowRough Equivalent
GPT-4o128K tokens~96,000 words (~300 pages)
Claude 3.5 Sonnet200K tokens~150,000 words (~500 pages)
Gemini 2.5 Pro1M+ tokens~750,000 words (several books)
LLaMA 3 (default)8K tokens~6,000 words (~20 pages)
Mistral 7B32K tokens~24,000 words (~80 pages)

Context Window Tradeoffs

  • Bigger is not always better — Models can struggle to find relevant information in very long contexts ("lost in the middle" problem)
  • Cost scales linearly — Filling a 128K context window costs 16x more than an 8K window
  • Latency increases — Processing longer contexts takes more time before the first token appears
  • Quality may degrade — Attention is spread thinner across longer inputs

Practical tip: Don't dump everything into the context window because you can. Give the model only what it needs. Shorter, focused contexts generally produce better results than massive context dumps.

Temperature — Controlling Randomness

What Is Temperature?

When an LLM predicts the next token, it calculates a probability for every possible token. Temperature controls how the model chooses among these possibilities:

Code gen (0–0.2)
Factual Q&A (0.3)
Summarization (0.5)
Conversation (0.7)
Creative writing (1.0)
Brainstorm (1.2)
Temperature 0
Deterministic — always picks most probable token
Temperature 1.5+
Very random — increasingly chaotic output

Temperature 0.3–0.7 is the sweet spot for most production tasks

Same prompt at different temperatures:

temp=0:"The best programming language for beginners is Python."
temp=0.7:"Python is widely considered the best starting language, though JavaScript is also a strong choice."
temp=1.5:"Honestly? Scratch. No, wait — maybe Lua. Actually, have you considered learning to talk to ducks first?"

When to Adjust Temperature

TaskRecommended TemperatureWhy
Code generation0 - 0.2Code needs to be correct, not creative
Factual Q&A0 - 0.3Facts should be consistent
Summarization0.3 - 0.5Needs accuracy with natural phrasing
General conversation0.5 - 0.8Natural and varied
Creative writing0.7 - 1.0Variety and surprise are the point
Brainstorming0.8 - 1.2Want unexpected ideas

Top-p (Nucleus Sampling)

Top-p is another way to control randomness. Instead of adjusting how "spread out" the probabilities are (temperature), top-p limits which tokens are even considered:

  • top-p = 1.0 — Consider all tokens (no filtering)
  • top-p = 0.9 — Only consider tokens in the top 90% of cumulative probability
  • top-p = 0.1 — Only consider the very top tokens (very focused)

In practice, most people adjust either temperature or top-p, not both. If you're just starting out, stick with temperature and leave top-p at 1.0.

Other Parameters You'll Encounter

  • Max tokens — The maximum number of tokens the model will generate in its response. Setting this too low cuts the response short; too high wastes budget if you're paying per token.
  • Stop sequences — Strings that tell the model to stop generating. Useful for structured output: "stop when you see \n\n" or "stop at END".
  • Frequency penalty — Reduces repetition by penalizing tokens that have already appeared. Useful when the model gets stuck in loops.
  • Presence penalty — Encourages the model to talk about new topics by penalizing any token that has appeared at all (regardless of frequency).
  • System prompt — Not technically a "parameter" but acts like one. It sets the model's behavior, persona, and constraints for the entire conversation.

Putting It All Together

When you send a message to an LLM, here's the full picture:

Tokenize
Text → token IDs
Context window
System prompt + history + your message
Forward pass
Model processes all tokens through layers
Sample
Temperature + top-p select next token
Repeat
Until max tokens or stop sequence
Bill
Input tokens + output tokens

Every message goes through this loop — understanding it helps you optimize cost, speed, and quality

Checklist: Do You Understand This?

  • Can you estimate how many tokens are in a 1,000-word document?
  • What happens if your prompt + response exceeds the context window?
  • What temperature would you use for code generation vs. brainstorming?
  • Why is a 128K context window not always better than 8K?
  • What does "max tokens" control?