Intermediate

Building Knowledge Graphs with AI

Before LLMs, building a knowledge graph required manual annotation, custom NLP pipelines, and domain experts. Now you can point an LLM at a corpus of documents and have a queryable graph in hours. The core pipeline has six phases — each with tool choices that trade off speed, flexibility, and control.

The 6-Phase Pipeline

Data Acquisition & Preparation

Gather source material: documents, PDFs, web pages, database exports, code. Chunk text appropriately — 512–1024 token chunks with overlap work well for extraction.

Entity Extraction

Identify nodes from text. LLMs offer maximum flexibility for domain-specific entities; spaCy gives fast deterministic extraction for standard types; GliNER handles zero-shot custom types.

Relationship Extraction

Identify edges between extracted entities. REBEL extracts entity-relationship triples simultaneously. LLMs handle complex contextual relationships. spacy-llm combines both.

Entity Resolution & Linking

Deduplicate and canonicalise — 'OpenAI', 'Open AI', and 'openai.com' are the same entity. Use Coreferee for pronoun resolution, Wikidata/DBpedia for external linking, or LLM clustering for proprietary graphs.

Graph Storage

Persist nodes, edges, and properties to a graph database. Neo4j with its LLM Graph Builder is the standard starting point. Add vector embeddings alongside the graph for hybrid retrieval.

Querying & Integration

Expose the graph to applications. Use LangChain's GraphCypherQAChain for natural language → Cypher, LlamaIndex's KnowledgeGraphQueryEngine, or build raw Cypher queries for precision.

Extraction Tool Choices

Tool	Phase	Approach	Best For
spaCy	Entity extraction	Deterministic NER — fast, no LLM cost	Standard entity types (person, org, location) at scale
GliNER	Entity extraction	Zero-shot learning for custom entity types	Domain-specific entities without training data
REBEL	Relationship extraction	Transformer extracting entity + relationship triples simultaneously	Efficient relation extraction from long text
LLM (any)	Entity + relationship	Prompt-guided extraction — maximum flexibility	Complex domains, custom schemas, contextual relationships
Coreferee	Entity resolution	Pronoun and reference resolution (“she” → Alice)	Narrative text, unstructured documents

Framework Approaches

Three practical patterns for building LLM-assisted knowledge graphs, ordered from fastest-to-start to most flexible:

Neo4j LLM Graph Builder

No-code, browser-based tool. Upload documents (PDF, web pages, YouTube transcripts), choose your LLM, and the builder extracts entities and relationships automatically. Free tier on AuraDB.

Fastest path — least control over extraction schema

LangChain LLMGraphTransformer

Python library. Define allowed node types and relationship types, pass documents, and LangChain calls your LLM to extract graph triples. Writes directly to Neo4j via the LangChain graph store interface.

Good balance of speed and control. Schema guided.

LlamaIndex PropertyGraph

Extracts triplets (Subject, Predicate, Object) and stores them alongside original text chunks. Enables hybrid retrieval: vector search to find starting nodes, then graph traversal for context.

Best hybrid RAG integration. More setup required.

Microsoft GraphRAG

Microsoft Research released GraphRAG in 2024 to address the limits of naive vector RAG on complex synthesis tasks. It doesn't require a predefined schema — it uses an LLM to extract entities and relationships, then applies the Leiden algorithm to cluster them into communities. Each community gets an LLM-generated summary. At query time, both local search (specific entities) and global search (community summaries) are available.

What makes it powerful

Hierarchical community summaries enable global synthesis queries
86% accuracy on multi-hop tasks vs 32% for vector RAG
Handles “what are the main themes across this entire corpus?” — impossible for vector search

The cost caveat

Full indexing costs $20–$500 per corpus (many LLM calls to generate summaries)
LazyGraphRAG (June 2025) reduces this to under $5 by deferring summaries to query time
Static summaries require reindexing when new data is added

Entity Resolution — The Hard Part

Extraction is tractable. Resolution is the part most tutorials skip. The same real-world entity appears under many names: “GPT-4”, “GPT4”, “gpt-4-turbo”, “OpenAI's flagship model”. Without resolution, your graph fragments into duplicate nodes and broken relationship chains.

Practical approaches:

Rule-based normalisation — lowercase, strip punctuation, canonical name lists
Embedding similarity clustering — embed entity mentions, cluster near-duplicates, assign canonical ID
Wikidata/DBpedia linking — anchor entities to external knowledge bases for ground truth canonicalisation
LLM deduplication — ask the LLM “are these referring to the same entity?” for ambiguous cases

Checklist: Do You Understand This?

The 6 phases: data prep → entity extraction → relationship extraction → entity resolution → graph storage → query integration
Neo4j LLM Graph Builder is the fastest no-code path; LangChain and LlamaIndex give more control
Entity resolution (deduplication and canonicalisation) is the step most implementations skip — and the source of most graph quality issues
Microsoft GraphRAG uses Leiden community detection + LLM summaries to enable global synthesis queries
LazyGraphRAG (2025) reduces the $20–500 indexing cost to under $5 by deferring summaries to query time