Intermediate

Embedding Models

An embedding model converts text into a dense numerical vector that represents its meaning. Similar texts produce similar vectors. Choosing the right embedding model affects retrieval quality, cost, and storage requirements — independently of which LLM you use to generate answers.

What an Embedding Is

Input
"How do I reset my password?"
"I forgot my login credentials"
Embedding Model
text-embedding-3-small
Vector Space
[0.12, -0.87, 0.34, …] (1536 dims)
[0.11, -0.85, 0.36, …] (close!)

Similar sentences produce close vectors — enabling semantic search

An embedding is a fixed-length vector of floating-point numbers — typically 768 to 3072 dimensions. The embedding model is trained to place semantically similar texts close together in this vector space and dissimilar texts far apart. This is what enables similarity search: find the stored vectors closest to the query vector.

Embeddings capture meaning, not exact wording. "How do I reset my password?" and "I forgot my login credentials" will have similar embeddings even though they share no words. This is the core advantage over keyword search.

Embedding Model Options

OpenAI — text-embedding-3-small / text-embedding-3-large

  • Dimensions: 1536 (small), 3072 (large) — or lower with dimension reduction
  • Cost: $0.020 per million tokens (small), $0.130 per million tokens (large)
  • Strengths: Strong general performance, widely supported by vector databases, easy to get started with
  • Best for: Most production RAG use cases; default choice if already using OpenAI infrastructure

Cohere — embed-english-v3.0 / embed-multilingual-v3.0

  • Dimensions: 1024
  • Cost: $0.100 per million tokens
  • Strengths: Strong performance on retrieval benchmarks; multilingual model covers 100+ languages; native int8/binary quantisation support reduces storage
  • Best for: Multilingual knowledge bases; storage-constrained deployments

Open-source — bge, e5, nomic-embed (self-hosted)

  • Popular models: BAAI/bge-small-en-v1.5 (384 dims), intfloat/e5-large-v2 (1024 dims), nomic-ai/nomic-embed-text-v1.5
  • Cost: Inference compute only — no per-token API fees
  • Strengths: No data leaves your infrastructure; competitive quality on MTEB benchmarks; smaller models run on CPU
  • Best for: Air-gapped environments, privacy-sensitive data, cost-sensitive high-volume use cases

Choosing by Task

  • General English Q&A over documents: text-embedding-3-small — good quality, low cost, easy to use
  • Multilingual support: cohere/embed-multilingual-v3.0 — purpose-built for cross-language retrieval
  • Privacy/on-prem requirement: bge-small-en-v1.5 or nomic-embed-text-v1.5 — self-hosted, no external API calls
  • Code search: Use a code-specific embedding model — text-embedding-3-large handles code reasonably; dedicated code models (voyage-code-2) outperform on code retrieval tasks
  • Domain-specific (medical, legal): Consider domain-adapted open-source models fine-tuned on domain corpora — general models may miss specialised terminology

Dimensionality and Storage Implications

Each vector is stored as an array of 32-bit floats. Storage cost per million chunks:

  • 384 dimensions (bge-small): ~1.5 GB per million vectors
  • 1024 dimensions (Cohere, e5-large): ~4 GB per million vectors
  • 1536 dimensions (text-embedding-3-small): ~6 GB per million vectors
  • 3072 dimensions (text-embedding-3-large): ~12 GB per million vectors

Higher dimensions generally improve retrieval quality but increase storage and search latency. For most use cases under 10M chunks, this difference is not significant. At scale, consider int8 quantisation (available in Cohere models and supported by Qdrant, Weaviate) which reduces storage by 4x with minimal quality loss.

Re-Embedding: When to Update Your Index

You must use the same embedding model for ingestion and query time — mixing models produces incompatible vector spaces. This means:

  • Switching to a new embedding model requires re-embedding all documents and rebuilding the index — plan for this as a maintenance operation
  • Set a policy: re-embed when the current model is deprecated or when benchmark quality drops below a threshold
  • Keep the original document text in storage — you can always re-embed from text, but losing the source text requires re-ingesting from scratch
  • Version your index: maintain a staging index for testing a new embedding model before cutting over production

Checklist: Do You Understand This?

  • Embedding = dense vector capturing semantic meaning — similar texts produce similar vectors regardless of exact wording
  • Same embedding model must be used for ingestion and query time — different models are incompatible
  • Default choice: text-embedding-3-small (OpenAI) for general English; Cohere multilingual for multi-language; bge/nomic for self-hosted
  • Higher dimensions = more quality but more storage; int8 quantisation reduces storage 4x with minimal quality loss
  • Switching models requires re-embedding everything — keep source text in storage to enable future re-indexing

Page built: 01 Jun 2026