Beginner

AI Glossary

Plain-English definitions of the terms you'll encounter most often in AI. Each definition focuses on what the concept means in practice — not the mathematical formalism.

A B C D E F G H I K L M N O P Q R S T V Z

A

Agent / Agentic AI: An AI system that takes actions in the world — calling tools, running code, browsing the web, writing files — rather than just generating text. An agent has a goal and decides what steps to take to reach it.
Alignment: The challenge of making AI systems behave in ways that match human intentions and values. Alignment work includes RLHF, Constitutional AI, and interpretability research.
Attention Mechanism: The core innovation of the Transformer architecture. Attention lets each token in a sequence look at every other token and decide how relevant they are, enabling the model to track long-range context and relationships.

B

Backpropagation: The algorithm used to train neural networks. It measures the error in the model's output, then propagates that error backwards through the network to adjust weights — repeated millions of times during training.
Benchmark: A standardised test used to compare AI models. Examples: MMLU (knowledge), SWE-bench (coding), GPQA Diamond (scientific reasoning), HumanEval (code generation). Results should always be read with scepticism — models are often tuned to perform well on popular benchmarks.

C

Chain-of-Thought (CoT): A prompting technique where you ask the model to work through a problem step by step before giving an answer. Improves accuracy on reasoning, maths, and multi-step tasks. The phrase 'think step by step' activates this.
Claude Code: Anthropic's agentic coding tool — a CLI (and IDE extensions) that gives Claude access to your file system, terminal, and git. Used for autonomous coding tasks that span multiple files.
Computer Use: The ability of an AI model to control a computer — clicking, typing, reading the screen — as if it were a human user. Anthropic's Claude and similar models can operate desktop and browser environments.
Constitutional AI: Anthropic's technique for training AI to follow a set of principles ('the constitution') by having the AI critique and revise its own outputs according to those principles, reducing the need for human labelling of harmful content.
Context Caching: A feature offered by some AI providers (Anthropic, Google) that stores a frequently reused portion of a prompt (e.g., a large system prompt or document) in a cache. Subsequent requests that hit the cache cost roughly 90% less for those cached tokens.
Context Window: The maximum amount of text (measured in tokens) a model can read and hold in 'working memory' at once. Models with larger context windows can process longer documents, more chat history, or bigger codebases in a single call.

D

Distillation: Training a smaller model (the student) to mimic the outputs of a larger, more powerful model (the teacher). The resulting small model is cheaper and faster to run while retaining much of the teacher's capability.

E

Embedding: A numerical representation of text (or images, audio) as a vector — a list of numbers. Similar meanings produce similar vectors. Embeddings are used to compare semantic similarity, power vector databases, and fuel RAG pipelines.
Evaluation (Eval): The process of measuring an AI system's performance on specific tasks. Evals can be automated (LLM-as-judge, unit tests on outputs), human-rated, or benchmark-based. Good evals are the foundation of reliable AI product development.

F

Few-Shot Learning: Providing the model with a small number of examples in the prompt to demonstrate the task format or style. 'Few-shot' means 2–10 examples. Contrast with zero-shot (no examples) and one-shot (exactly one example).
Fine-tuning: Continuing to train a pre-trained model on a smaller, task-specific dataset to improve its performance on that task. Less data than pre-training, much cheaper, and produces a specialised model. LoRA is the dominant efficient fine-tuning technique.
Foundation Model: A large model trained on vast amounts of general data, designed to be adapted for many different downstream tasks. GPT-4, Claude, Gemini, and Llama are all foundation models. The term emphasises that they serve as the foundation for more specific applications.

G

GraphRAG: A RAG architecture that uses a knowledge graph for retrieval instead of (or in addition to) vector search. Excels at multi-hop reasoning and global synthesis across large corpora. Microsoft's GraphRAG uses hierarchical community detection to enable corpus-wide queries.
Grounding: Connecting an AI model's outputs to verifiable facts or sources. Grounding techniques include RAG (attaching retrieved documents), citations, tool use (real-time web search), and structured data lookups. Reduces hallucination.
Guardrails: Rules, filters, and safety layers that constrain what an AI model will do. Can be prompt-level (system prompt instructions), model-level (trained safety behaviour), or infrastructure-level (output classifiers, topic blockers).

H

Hallucination: When an AI model confidently produces false, fabricated, or nonsensical content. Not a bug — it's a consequence of how language models work (they generate plausible text, not factually verified text). Mitigated by grounding, RAG, and careful evaluation.

I

Inference: Running a trained model to generate outputs — the production-time use of a model. Inference is what happens when you send a prompt to Claude or GPT-4 and receive a response. Contrast with training, which is the computationally expensive process of building the model.

K

Knowledge Graph: A structured representation of information as nodes (entities) and edges (typed relationships). Unlike a database's rigid tables, a knowledge graph models arbitrary relationships and supports multi-hop traversal queries.

L

Large Language Model (LLM): A neural network trained on vast text data to predict and generate text. 'Large' refers to the number of parameters (billions to trillions). GPT-4, Claude, Gemini, and Llama are all LLMs.
Latency: The time between sending a request to an AI model and receiving the first token of the response (time-to-first-token, TTFT). Critical for real-time applications like voice assistants and chat interfaces. Larger models are generally slower.
LoRA (Low-Rank Adaptation): The dominant technique for efficient fine-tuning. Instead of updating all model weights, LoRA adds small adapter matrices to specific layers and trains only those — reducing memory and compute by 10–100×. Most fine-tuned models in production use LoRA.

M

MCP (Model Context Protocol): An open standard released by Anthropic in November 2024 for connecting AI models to external tools, data sources, and services. MCP defines a common protocol so any AI assistant can talk to any MCP-compatible server (file systems, databases, APIs, etc.).
Mixture of Experts (MoE): A model architecture where only a subset of the model's parameters (the 'experts') are activated for each token, rather than the full network. This allows very large total parameter counts while keeping per-token compute manageable. Llama 4 and DeepSeek V3/V4 use MoE.
Multi-Agent System: An architecture where multiple AI agents collaborate — orchestrator agents delegate tasks to specialist sub-agents, which report results back. Enables parallelism, specialisation, and tasks too complex for a single context window.
Multimodal: A model that can process and/or generate multiple types of data — text, images, audio, video. GPT-4o, Claude 3 Opus, and Gemini are multimodal models. 'Natively multimodal' means the model was trained on mixed modalities from the start, not retrofitted.

N

Neural Network: A computational structure loosely inspired by the brain — layers of interconnected nodes (neurons) with learned weights. The transformer architecture used by all modern LLMs is a type of neural network.

O

Ontology: A formal schema defining the types of entities and relationships in a domain. In knowledge graphs, an ontology acts as a schema — specifying allowed node types, edge types, and constraints. A light ontology guides LLM extraction toward consistent structure.
Overfitting: When a model learns the training data too well — including its noise and quirks — and loses the ability to generalise to new examples. The classic AI failure mode: perfect on the training set, poor on anything it hasn't seen.

P

Parameters / Weights: The numerical values inside a neural network that are learned during training. A model described as '70B parameters' has 70 billion such values. More parameters generally means more capability — but also more compute cost and memory.
Pre-training: The first and most expensive phase of building a language model: training on massive amounts of text (web pages, books, code) to predict the next token. Pre-training is what gives a model general language understanding. Fine-tuning happens afterwards.
Prompt Engineering: The practice of crafting inputs to AI models to get better outputs — through techniques like system prompts, few-shot examples, chain-of-thought instructions, and structured output requests. Part skill, part science.

Q

Quantization: Reducing the numerical precision of model weights (e.g., from 32-bit floats to 4-bit integers) to shrink model size and speed up inference. 4-bit quantized models run on consumer GPUs. Some quality is lost, but the tradeoff is often worth it for local deployment.

R

RAG (Retrieval-Augmented Generation): An architecture that retrieves relevant documents from an external store (vector database, graph, or search index) and injects them into the model's context before generating a response. RAG grounds the model in specific knowledge without fine-tuning.
Reasoning Model: A model class that spends compute 'thinking' before answering — generating an internal chain of thought that reasons through the problem step by step. OpenAI's o-series, Claude 3.7 Sonnet extended thinking, and DeepSeek R1 are reasoning models.
RLHF (Reinforcement Learning from Human Feedback): A training technique where humans rate model outputs, and those ratings are used to train a reward model, which then guides the LLM to produce outputs humans prefer. RLHF is what makes base models into helpful, harmless assistants.

S

SLM (Small Language Model): A language model with far fewer parameters than frontier LLMs — typically under 10B. Can run on consumer hardware or on-device. Examples: Phi-4 (Microsoft), Gemini Nano, Llama 3.2 3B. Often fine-tuned for specific narrow tasks.
Softmax: A mathematical function that converts raw model scores (logits) into a probability distribution summing to 1. Used at the output layer to express how likely each token is as the next word.
System Prompt: Instructions given to a model before the conversation begins — defining its role, constraints, tone, and behaviour. The system prompt is typically invisible to end users and is set by the application developer. Central to building AI products.

T

Temperature: A parameter controlling how random the model's outputs are. Temperature 0 = deterministic (always picks the most probable next token). Higher temperature = more creative and varied but less reliable. Typical values: 0 (code/data tasks), 0.7 (general use), 1.0+ (creative writing).
Test-Time Compute: The idea of allocating more compute at inference time (not just during training) to improve output quality. Reasoning models use test-time compute to 'think' — generating internal reasoning steps before producing an answer.
Token: The basic unit a language model reads and writes. A token is roughly 0.75 words in English — 'unbelievable' might be 3–4 tokens, a short sentence 10–15. Models are priced per 1,000 or 1,000,000 tokens. Context window size is measured in tokens.
Tool Calling (Function Calling): The ability of a model to decide to invoke an external function or API — search the web, run code, query a database — and incorporate the result into its response. The foundation of agentic AI behaviour.
Top-p (Nucleus Sampling): A sampling technique that restricts the model to the smallest set of tokens whose cumulative probability exceeds p. Top-p 0.9 means only tokens covering the top 90% of probability mass are considered. Often used alongside temperature.
Transformer: The neural network architecture that powers virtually all modern LLMs. Introduced in 'Attention is All You Need' (2017). Its key innovation is the attention mechanism, which allows the model to process all tokens in parallel and model long-range dependencies.

V

Vector Database: A database designed to store and query embeddings (vectors). Similarity search finds vectors closest to a query vector using algorithms like HNSW or FAISS. Examples: Pinecone, Weaviate, Chroma, pgvector (Postgres). The storage layer for most RAG systems.

Z

Zero-Shot Learning: Asking a model to perform a task without providing any examples — relying entirely on the model's pre-training. Modern LLMs are surprisingly capable zero-shot. Add few-shot examples when zero-shot quality is insufficient.