Intermediate

Metadata & Filtering

Metadata transforms pure semantic similarity search into scoped, precise retrieval. Without metadata, you can only ask "which chunks are semantically similar to this query?" With metadata, you can ask "which chunks from this document, written after this date, in this category, are similar to this query?"

What Metadata to Attach

Attach metadata at ingestion time — once a chunk is in the index without metadata, you cannot add it without re-indexing. Standard fields worth capturing:

source: The source document or URL — enables attribution in answers ("According to Employee Handbook...") and source-scoped retrieval
date / created_at: Document creation or last-updated date — enables date range filtering ("only documents updated in 2025")
author: Who wrote the document — useful for attributed content
section / heading: The document section the chunk belongs to — enables section-scoped retrieval
document_type: Category of document (policy, FAQ, tutorial, legal, product-spec) — enables type-scoped retrieval
chunk_index: The position of this chunk within the document — useful for reassembling context or excluding short header-only chunks
tenant / user_id: For multi-tenant systems, scope retrieval to the correct tenant without cross-contamination

Metadata Schema Design

Design your metadata schema before building the index — adding new fields later requires re-indexing all documents. Principles:

Only index metadata fields you will actually filter on — metadata storage is cheap, but unused fields are clutter
Use consistent value formats — dates as ISO 8601 strings, categories as lowercase snake_case enums — so filters work reliably
Keep cardinality in mind: high-cardinality fields (URLs, full filenames) are good for exact lookups; low-cardinality fields (categories, document types) are good for broad filtering

# Example metadata schema for a company knowledge base
{
  "source": "policies/leave-policy-2025.pdf",
  "document_type": "policy",
  "section": "Annual Leave Entitlement",
  "created_at": "2025-01-15",
  "updated_at": "2025-03-01",
  "department": "hr",
  "chunk_index": 3,
  "total_chunks": 12
}

Using Metadata for Filtered Search

Filtered search combines a vector similarity query with a metadata filter. The filter runs as a pre-filter (narrow the candidate set before similarity search) or post-filter (retrieve, then filter results):

# Example: Pinecone filtered query
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "document_type": {"$eq": "policy"},
        "department": {"$eq": "hr"},
        "updated_at": {"$gte": "2025-01-01"}
    },
    include_metadata=True
)

# Example: Chroma filtered query
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=10,
    where={
        "$and": [
            {"document_type": {"$eq": "policy"}},
            {"department": {"$eq": "hr"}}
        ]
    }
)

Pass filter parameters to Claude as part of the retrieval tool. When a user says "what does the 2025 HR policy say about annual leave?", your code should extract "2025" and "HR" and apply corresponding date and department filters before the similarity search.

How Metadata Improves Precision

Without filters, a similarity search over a large knowledge base returns the most semantically similar chunks across all documents. This means:

An HR policy question may return results from a sales playbook that happens to use similar language
A question about 2025 procedures may return results from an outdated 2022 policy
A tenant A question may surface tenant B's confidential data if the two have overlapping content

Metadata filters make retrieval deterministic for known attributes — the search space narrows before similarity scoring begins.

Updating Metadata Without Re-Embedding

In most vector databases, you can update metadata fields without changing the vector. Pinecone, Qdrant, and Weaviate all support metadata-only updates. This means:

Updating a document status, adding a tag, or correcting a date does not require re-embedding the text
Only re-embed when the text itself changes — text changes require new embeddings
Design your ingestion pipeline to update metadata fields independently from text embeddings where possible

Checklist: Do You Understand This?

Attach metadata at ingestion — adding it later requires re-indexing
Standard fields: source, date, document_type, section, tenant_id, chunk_index
Use consistent formats: ISO dates, lowercase enums — filters break on inconsistent values
Filtered search: combine vector similarity with metadata filter to scope retrieval to relevant documents
Metadata-only updates don't require re-embedding — only re-embed when the text changes