Metadata & Filtering
Metadata transforms pure semantic similarity search into scoped, precise retrieval. Without metadata, you can only ask "which chunks are semantically similar to this query?" With metadata, you can ask "which chunks from this document, written after this date, in this category, are similar to this query?"
What Metadata to Attach
Attach metadata at ingestion time — once a chunk is in the index without metadata, you cannot add it without re-indexing. Standard fields worth capturing:
- source: The source document or URL — enables attribution in answers ("According to Employee Handbook...") and source-scoped retrieval
- date / created_at: Document creation or last-updated date — enables date range filtering ("only documents updated in 2025")
- author: Who wrote the document — useful for attributed content
- section / heading: The document section the chunk belongs to — enables section-scoped retrieval
- document_type: Category of document (policy, FAQ, tutorial, legal, product-spec) — enables type-scoped retrieval
- chunk_index: The position of this chunk within the document — useful for reassembling context or excluding short header-only chunks
- tenant / user_id: For multi-tenant systems, scope retrieval to the correct tenant without cross-contamination
Metadata Schema Design
Design your metadata schema before building the index — adding new fields later requires re-indexing all documents. Principles:
- Only index metadata fields you will actually filter on — metadata storage is cheap, but unused fields are clutter
- Use consistent value formats — dates as ISO 8601 strings, categories as lowercase snake_case enums — so filters work reliably
- Keep cardinality in mind: high-cardinality fields (URLs, full filenames) are good for exact lookups; low-cardinality fields (categories, document types) are good for broad filtering
# Example metadata schema for a company knowledge base
{
"source": "policies/leave-policy-2025.pdf",
"document_type": "policy",
"section": "Annual Leave Entitlement",
"created_at": "2025-01-15",
"updated_at": "2025-03-01",
"department": "hr",
"chunk_index": 3,
"total_chunks": 12
}Using Metadata for Filtered Search
Filtered search combines a vector similarity query with a metadata filter. The filter runs as a pre-filter (narrow the candidate set before similarity search) or post-filter (retrieve, then filter results):
# Example: Pinecone filtered query
results = index.query(
vector=query_embedding,
top_k=10,
filter={
"document_type": {"$eq": "policy"},
"department": {"$eq": "hr"},
"updated_at": {"$gte": "2025-01-01"}
},
include_metadata=True
)
# Example: Chroma filtered query
results = collection.query(
query_embeddings=[query_embedding],
n_results=10,
where={
"$and": [
{"document_type": {"$eq": "policy"}},
{"department": {"$eq": "hr"}}
]
}
)Pass filter parameters to Claude as part of the retrieval tool. When a user says "what does the 2025 HR policy say about annual leave?", your code should extract "2025" and "HR" and apply corresponding date and department filters before the similarity search.
How Metadata Improves Precision
Without filters, a similarity search over a large knowledge base returns the most semantically similar chunks across all documents. This means:
- An HR policy question may return results from a sales playbook that happens to use similar language
- A question about 2025 procedures may return results from an outdated 2022 policy
- A tenant A question may surface tenant B's confidential data if the two have overlapping content
Metadata filters make retrieval deterministic for known attributes — the search space narrows before similarity scoring begins.
Updating Metadata Without Re-Embedding
In most vector databases, you can update metadata fields without changing the vector. Pinecone, Qdrant, and Weaviate all support metadata-only updates. This means:
- Updating a document status, adding a tag, or correcting a date does not require re-embedding the text
- Only re-embed when the text itself changes — text changes require new embeddings
- Design your ingestion pipeline to update metadata fields independently from text embeddings where possible
Checklist: Do You Understand This?
- Attach metadata at ingestion — adding it later requires re-indexing
- Standard fields: source, date, document_type, section, tenant_id, chunk_index
- Use consistent formats: ISO dates, lowercase enums — filters break on inconsistent values
- Filtered search: combine vector similarity with metadata filter to scope retrieval to relevant documents
- Metadata-only updates don't require re-embedding — only re-embed when the text changes