Building Knowledge Graphs with AI
Before LLMs, building a knowledge graph required manual annotation, custom NLP pipelines, and domain experts. Now you can point an LLM at a corpus of documents and have a queryable graph in hours. The core pipeline has six phases β each with tool choices that trade off speed, flexibility, and control.
The 6-Phase Pipeline
Gather source material: documents, PDFs, web pages, database exports, code. Chunk text appropriately β 512β1024 token chunks with overlap work well for extraction.
Identify nodes from text. LLMs offer maximum flexibility for domain-specific entities; spaCy gives fast deterministic extraction for standard types; GliNER handles zero-shot custom types.
Identify edges between extracted entities. REBEL extracts entity-relationship triples simultaneously. LLMs handle complex contextual relationships. spacy-llm combines both.
Deduplicate and canonicalise β 'OpenAI', 'Open AI', and 'openai.com' are the same entity. Use Coreferee for pronoun resolution, Wikidata/DBpedia for external linking, or LLM clustering for proprietary graphs.
Persist nodes, edges, and properties to a graph database. Neo4j with its LLM Graph Builder is the standard starting point. Add vector embeddings alongside the graph for hybrid retrieval.
Expose the graph to applications. Use LangChain's GraphCypherQAChain for natural language β Cypher, LlamaIndex's KnowledgeGraphQueryEngine, or build raw Cypher queries for precision.
Extraction Tool Choices
| Tool | Phase | Approach | Best For |
|---|---|---|---|
| spaCy | Entity extraction | Deterministic NER β fast, no LLM cost | Standard entity types (person, org, location) at scale |
| GliNER | Entity extraction | Zero-shot learning for custom entity types | Domain-specific entities without training data |
| REBEL | Relationship extraction | Transformer extracting entity + relationship triples simultaneously | Efficient relation extraction from long text |
| LLM (any) | Entity + relationship | Prompt-guided extraction β maximum flexibility | Complex domains, custom schemas, contextual relationships |
| Coreferee | Entity resolution | Pronoun and reference resolution (βsheβ β Alice) | Narrative text, unstructured documents |
Framework Approaches
Three practical patterns for building LLM-assisted knowledge graphs, ordered from fastest-to-start to most flexible:
Neo4j LLM Graph Builder
No-code, browser-based tool. Upload documents (PDF, web pages, YouTube transcripts), choose your LLM, and the builder extracts entities and relationships automatically. Free tier on AuraDB.
Fastest path β least control over extraction schema
LangChain LLMGraphTransformer
Python library. Define allowed node types and relationship types, pass documents, and LangChain calls your LLM to extract graph triples. Writes directly to Neo4j via the LangChain graph store interface.
Good balance of speed and control. Schema guided.
LlamaIndex PropertyGraph
Extracts triplets (Subject, Predicate, Object) and stores them alongside original text chunks. Enables hybrid retrieval: vector search to find starting nodes, then graph traversal for context.
Best hybrid RAG integration. More setup required.
Microsoft GraphRAG
Microsoft Research released GraphRAG in 2024 to address the limits of naive vector RAG on complex synthesis tasks. It doesn't require a predefined schema β it uses an LLM to extract entities and relationships, then applies the Leiden algorithm to cluster them into communities. Each community gets an LLM-generated summary. At query time, both local search (specific entities) and global search (community summaries) are available.
What makes it powerful
- Hierarchical community summaries enable global synthesis queries
- 86% accuracy on multi-hop tasks vs 32% for vector RAG
- Handles βwhat are the main themes across this entire corpus?β β impossible for vector search
The cost caveat
- Full indexing costs $20β$500 per corpus (many LLM calls to generate summaries)
- LazyGraphRAG (June 2025) reduces this to under $5 by deferring summaries to query time
- Static summaries require reindexing when new data is added
Entity Resolution β The Hard Part
Extraction is tractable. Resolution is the part most tutorials skip. The same real-world entity appears under many names: βGPT-4β, βGPT4β, βgpt-4-turboβ, βOpenAI's flagship modelβ. Without resolution, your graph fragments into duplicate nodes and broken relationship chains.
Practical approaches:
- Rule-based normalisation β lowercase, strip punctuation, canonical name lists
- Embedding similarity clustering β embed entity mentions, cluster near-duplicates, assign canonical ID
- Wikidata/DBpedia linking β anchor entities to external knowledge bases for ground truth canonicalisation
- LLM deduplication β ask the LLM βare these referring to the same entity?β for ambiguous cases
Checklist: Do You Understand This?
- The 6 phases: data prep β entity extraction β relationship extraction β entity resolution β graph storage β query integration
- Neo4j LLM Graph Builder is the fastest no-code path; LangChain and LlamaIndex give more control
- Entity resolution (deduplication and canonicalisation) is the step most implementations skip β and the source of most graph quality issues
- Microsoft GraphRAG uses Leiden community detection + LLM summaries to enable global synthesis queries
- LazyGraphRAG (2025) reduces the $20β500 indexing cost to under $5 by deferring summaries to query time