Intermediate

Building Knowledge Graphs with AI

Before LLMs, building a knowledge graph required manual annotation, custom NLP pipelines, and domain experts. Now you can point an LLM at a corpus of documents and have a queryable graph in hours. The core pipeline has six phases β€” each with tool choices that trade off speed, flexibility, and control.

The 6-Phase Pipeline

1
Data Acquisition & Preparation

Gather source material: documents, PDFs, web pages, database exports, code. Chunk text appropriately β€” 512–1024 token chunks with overlap work well for extraction.

2
Entity Extraction

Identify nodes from text. LLMs offer maximum flexibility for domain-specific entities; spaCy gives fast deterministic extraction for standard types; GliNER handles zero-shot custom types.

3
Relationship Extraction

Identify edges between extracted entities. REBEL extracts entity-relationship triples simultaneously. LLMs handle complex contextual relationships. spacy-llm combines both.

4
Entity Resolution & Linking

Deduplicate and canonicalise β€” 'OpenAI', 'Open AI', and 'openai.com' are the same entity. Use Coreferee for pronoun resolution, Wikidata/DBpedia for external linking, or LLM clustering for proprietary graphs.

5
Graph Storage

Persist nodes, edges, and properties to a graph database. Neo4j with its LLM Graph Builder is the standard starting point. Add vector embeddings alongside the graph for hybrid retrieval.

6
Querying & Integration

Expose the graph to applications. Use LangChain's GraphCypherQAChain for natural language β†’ Cypher, LlamaIndex's KnowledgeGraphQueryEngine, or build raw Cypher queries for precision.

Extraction Tool Choices

ToolPhaseApproachBest For
spaCyEntity extractionDeterministic NER β€” fast, no LLM costStandard entity types (person, org, location) at scale
GliNEREntity extractionZero-shot learning for custom entity typesDomain-specific entities without training data
REBELRelationship extractionTransformer extracting entity + relationship triples simultaneouslyEfficient relation extraction from long text
LLM (any)Entity + relationshipPrompt-guided extraction β€” maximum flexibilityComplex domains, custom schemas, contextual relationships
CorefereeEntity resolutionPronoun and reference resolution (β€œshe” β†’ Alice)Narrative text, unstructured documents

Framework Approaches

Three practical patterns for building LLM-assisted knowledge graphs, ordered from fastest-to-start to most flexible:

Neo4j LLM Graph Builder

No-code, browser-based tool. Upload documents (PDF, web pages, YouTube transcripts), choose your LLM, and the builder extracts entities and relationships automatically. Free tier on AuraDB.

Fastest path β€” least control over extraction schema

LangChain LLMGraphTransformer

Python library. Define allowed node types and relationship types, pass documents, and LangChain calls your LLM to extract graph triples. Writes directly to Neo4j via the LangChain graph store interface.

Good balance of speed and control. Schema guided.

LlamaIndex PropertyGraph

Extracts triplets (Subject, Predicate, Object) and stores them alongside original text chunks. Enables hybrid retrieval: vector search to find starting nodes, then graph traversal for context.

Best hybrid RAG integration. More setup required.

Microsoft GraphRAG

Microsoft Research released GraphRAG in 2024 to address the limits of naive vector RAG on complex synthesis tasks. It doesn't require a predefined schema β€” it uses an LLM to extract entities and relationships, then applies the Leiden algorithm to cluster them into communities. Each community gets an LLM-generated summary. At query time, both local search (specific entities) and global search (community summaries) are available.

What makes it powerful

  • Hierarchical community summaries enable global synthesis queries
  • 86% accuracy on multi-hop tasks vs 32% for vector RAG
  • Handles β€œwhat are the main themes across this entire corpus?” β€” impossible for vector search

The cost caveat

  • Full indexing costs $20–$500 per corpus (many LLM calls to generate summaries)
  • LazyGraphRAG (June 2025) reduces this to under $5 by deferring summaries to query time
  • Static summaries require reindexing when new data is added

Entity Resolution β€” The Hard Part

Extraction is tractable. Resolution is the part most tutorials skip. The same real-world entity appears under many names: β€œGPT-4”, β€œGPT4”, β€œgpt-4-turbo”, β€œOpenAI's flagship model”. Without resolution, your graph fragments into duplicate nodes and broken relationship chains.

Practical approaches:

  • Rule-based normalisation β€” lowercase, strip punctuation, canonical name lists
  • Embedding similarity clustering β€” embed entity mentions, cluster near-duplicates, assign canonical ID
  • Wikidata/DBpedia linking β€” anchor entities to external knowledge bases for ground truth canonicalisation
  • LLM deduplication β€” ask the LLM β€œare these referring to the same entity?” for ambiguous cases

Checklist: Do You Understand This?

  • The 6 phases: data prep β†’ entity extraction β†’ relationship extraction β†’ entity resolution β†’ graph storage β†’ query integration
  • Neo4j LLM Graph Builder is the fastest no-code path; LangChain and LlamaIndex give more control
  • Entity resolution (deduplication and canonicalisation) is the step most implementations skip β€” and the source of most graph quality issues
  • Microsoft GraphRAG uses Leiden community detection + LLM summaries to enable global synthesis queries
  • LazyGraphRAG (2025) reduces the $20–500 indexing cost to under $5 by deferring summaries to query time

Page built: 01 Jun 2026