Intermediate

Embedding Models

An embedding model converts text into a dense numerical vector that represents its meaning. Similar texts produce similar vectors. Choosing the right embedding model affects retrieval quality, cost, and storage requirements — independently of which LLM you use to generate answers.

What an Embedding Is

Input

"How do I reset my password?"

"I forgot my login credentials"

Embedding Model

text-embedding-3-small

Vector Space

[0.12, -0.87, 0.34, …] (1536 dims)

[0.11, -0.85, 0.36, …] (close!)

Similar sentences produce close vectors — enabling semantic search

An embedding is a fixed-length vector of floating-point numbers — typically 768 to 3072 dimensions. The embedding model is trained to place semantically similar texts close together in this vector space and dissimilar texts far apart. This is what enables similarity search: find the stored vectors closest to the query vector.

Embeddings capture meaning, not exact wording. "How do I reset my password?" and "I forgot my login credentials" will have similar embeddings even though they share no words. This is the core advantage over keyword search.

Embedding Model Options

OpenAI — text-embedding-3-small / text-embedding-3-large

Dimensions: 1536 (small), 3072 (large) — or lower with dimension reduction
Cost: $0.020 per million tokens (small), $0.130 per million tokens (large)
Strengths: Strong general performance, widely supported by vector databases, easy to get started with
Best for: Most production RAG use cases; default choice if already using OpenAI infrastructure

Cohere — embed-english-v3.0 / embed-multilingual-v3.0

Dimensions: 1024
Cost: $0.100 per million tokens
Strengths: Strong performance on retrieval benchmarks; multilingual model covers 100+ languages; native int8/binary quantisation support reduces storage
Best for: Multilingual knowledge bases; storage-constrained deployments

Open-source — bge, e5, nomic-embed (self-hosted)

Popular models: BAAI/bge-small-en-v1.5 (384 dims), intfloat/e5-large-v2 (1024 dims), nomic-ai/nomic-embed-text-v1.5
Cost: Inference compute only — no per-token API fees
Strengths: No data leaves your infrastructure; competitive quality on MTEB benchmarks; smaller models run on CPU
Best for: Air-gapped environments, privacy-sensitive data, cost-sensitive high-volume use cases

Choosing by Task

General English Q&A over documents: text-embedding-3-small — good quality, low cost, easy to use
Multilingual support: cohere/embed-multilingual-v3.0 — purpose-built for cross-language retrieval
Privacy/on-prem requirement: bge-small-en-v1.5 or nomic-embed-text-v1.5 — self-hosted, no external API calls
Code search: Use a code-specific embedding model — text-embedding-3-large handles code reasonably; dedicated code models (voyage-code-2) outperform on code retrieval tasks
Domain-specific (medical, legal): Consider domain-adapted open-source models fine-tuned on domain corpora — general models may miss specialised terminology

Dimensionality and Storage Implications

Each vector is stored as an array of 32-bit floats. Storage cost per million chunks:

384 dimensions (bge-small): ~1.5 GB per million vectors
1024 dimensions (Cohere, e5-large): ~4 GB per million vectors
1536 dimensions (text-embedding-3-small): ~6 GB per million vectors
3072 dimensions (text-embedding-3-large): ~12 GB per million vectors

Higher dimensions generally improve retrieval quality but increase storage and search latency. For most use cases under 10M chunks, this difference is not significant. At scale, consider int8 quantisation (available in Cohere models and supported by Qdrant, Weaviate) which reduces storage by 4x with minimal quality loss.

Re-Embedding: When to Update Your Index

You must use the same embedding model for ingestion and query time — mixing models produces incompatible vector spaces. This means:

Switching to a new embedding model requires re-embedding all documents and rebuilding the index — plan for this as a maintenance operation
Set a policy: re-embed when the current model is deprecated or when benchmark quality drops below a threshold
Keep the original document text in storage — you can always re-embed from text, but losing the source text requires re-ingesting from scratch
Version your index: maintain a staging index for testing a new embedding model before cutting over production

Checklist: Do You Understand This?

Embedding = dense vector capturing semantic meaning — similar texts produce similar vectors regardless of exact wording
Same embedding model must be used for ingestion and query time — different models are incompatible
Default choice: text-embedding-3-small (OpenAI) for general English; Cohere multilingual for multi-language; bge/nomic for self-hosted
Higher dimensions = more quality but more storage; int8 quantisation reduces storage 4x with minimal quality loss
Switching models requires re-embedding everything — keep source text in storage to enable future re-indexing