From neo4j-skills
Ingests PDFs, HTML, plain text, Markdown, and JSON into Neo4j as knowledge graphs using chunking, LLM entity/relation extraction (neo4j-graphrag SimpleKGPipeline), apoc.load.json, or LangChain/LlamaIndex loaders. Covers RAG schema design.
npx claudepluginhub neo4j-contrib/neo4j-skillsThis skill is limited to using the following tools:
- Ingesting PDFs, HTML, plain text, Markdown into Neo4j as a knowledge graph
Imports structured CSV/JSON/Parquet data into Neo4j using LOAD CSV, CALL IN TRANSACTIONS, neo4j-admin bulk import, APOC procedures, and driver batching. Guides method selection, constraints, validation for migrations and large datasets.
Design and build knowledge graphs for modeling complex relationships, semantic search, and knowledge bases. Guides ontology design, entity relationships, and graph database selection.
Designs knowledge graphs from unstructured data. Guides model selection (LPG, RDF, hypergraph, temporal), schema/ontology design, entity/relation extraction pipelines, and layered architecture for RAG or analytics.
Share bugs, ideas, or general feedback.
:Chunk nodes with embeddingsSimpleKGPipeline (neo4j-graphrag) programmaticallyapoc.load.jsonneo4j-import-skillneo4j-graphrag-skillneo4j-vector-search-skillneo4j-cypher-skill| Situation | Approach |
|---|---|
| No code; drag-and-drop UX wanted | LLM Graph Builder web UI |
| Programmatic pipeline; PDFs/text | SimpleKGPipeline (neo4j-graphrag) |
| JSON / REST API responses | apoc.load.json or Python + UNWIND |
| LangChain already in stack | Neo4jGraph + document loader |
| LlamaIndex already in stack | Neo4jQueryEngine / Neo4jVectorStore |
| Chunk-only (no entity extraction) | Manual chunking + MERGE pattern |
pip install neo4j-graphrag # includes SimpleKGPipeline
pip install neo4j-graphrag[openai] # + OpenAI LLM/embedder
pip install neo4j-graphrag[anthropic] # + Anthropic Claude
pip install neo4j-graphrag[google] # + Vertex AI / Gemini
# spaCy entity resolver (Python <= 3.13 only — unsupported on 3.14+):
pip install neo4j-graphrag[nlp]
Requires: neo4j>=6.0.0, Python>=3.10, Neo4j>=5.18.1 (Aura>=5.18.0).
Schema controls what the LLM extracts. Define before pipeline construction.
# Option A — Simple string lists (LLM infers descriptions)
entities = ["Person", "Organization", "Location", "Product", "Event"]
relations = ["WORKS_AT", "LOCATED_IN", "KNOWS", "MENTIONS", "PART_OF"]
patterns = [
("Person", "WORKS_AT", "Organization"),
("Organization", "LOCATED_IN", "Location"),
("Person", "KNOWS", "Person"),
("Article", "MENTIONS", "Organization"),
]
# Option B — Rich schema (better extraction quality)
from neo4j_graphrag.experimental.components.schema import (
SchemaBuilder, SchemaEntity, SchemaRelation
)
schema = SchemaBuilder().create_schema_from_dict({
"entities": {
"Person": {"description": "A human individual", "properties": {"name": "str", "role": "str"}},
"Organization": {"description": "A company or institution", "properties": {"name": "str", "industry": "str"}},
},
"relations": {
"WORKS_AT": {"description": "Employment relationship"},
},
"patterns": [("Person", "WORKS_AT", "Organization")],
})
# Option C — Auto-extract schema from text (no constraints)
schema = "EXTRACTED" # LLM infers types; noisier output
schema = "FREE" # No schema guidance; most noise
Use Option B for production; Option A for prototyping; "EXTRACTED" only for exploration.
import asyncio
from neo4j import GraphDatabase
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
driver = GraphDatabase.driver(
"neo4j+s://xxxx.databases.neo4j.io",
auth=("neo4j", "password")
)
llm = OpenAILLM(
model_name="gpt-4o",
model_params={"temperature": 0, "response_format": {"type": "json_object"}},
)
embedder = OpenAIEmbeddings() # OPENAI_API_KEY from env
pipeline = SimpleKGPipeline(
llm=llm,
driver=driver,
embedder=embedder,
entities=entities, # from Step 1
relations=relations,
patterns=patterns,
from_file=True, # False → pass text= instead of file_path=
on_error="IGNORE", # RAISE to surface extraction failures
perform_entity_resolution=True,
neo4j_database="neo4j", # omit to use default
)
LLM alternatives (same interface):
AnthropicLLM(model_name="claude-3-5-sonnet-20241022")VertexAILLM(model_name="gemini-1.5-pro-002")OllamaLLM(model_name="llama3") — local; no API key needed# From PDF or Markdown file:
result = asyncio.run(pipeline.run_async(
file_path="report.pdf",
document_metadata={"source": "Q4 report", "year": 2025},
))
# From raw text:
result = asyncio.run(pipeline.run_async(
text=document_text,
))
# Batch — process multiple files:
async def ingest_all(paths):
for p in paths:
await pipeline.run_async(file_path=str(p))
asyncio.run(ingest_all(list(pdf_dir.glob("*.pdf"))))
document_metadata dict is stored as properties on the :Document node.
Default splitter: FixedSizeSplitter(chunk_size=300, chunk_overlap=50).
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
splitter = FixedSizeSplitter(
chunk_size=512, # tokens; 300–512 typical for GPT-4o
chunk_overlap=50, # ~10% of chunk_size; preserves boundary context
approximate=True, # respect sentence/word boundaries when possible
)
pipeline = SimpleKGPipeline(
...,
text_splitter=splitter,
)
Chunking guidance:
| Document type | chunk_size | chunk_overlap |
|---|---|---|
| Dense technical text | 256–512 | 50–80 |
| Narrative / news articles | 512–1024 | 80–128 |
| Legal / financial docs | 256–384 | 40–64 |
Rule: chunk must fit within LLM context for extraction + within embedding model limits. GPT-4o: 128k context; text-embedding-3-small: 8191 tokens. Never set chunk_size > 2048.
Merge duplicate extracted entities after pipeline run.
from neo4j_graphrag.experimental.components.resolver import (
SinglePropertyExactMatchResolver, # identical name → merge
FuzzyMatchResolver, # Levenshtein similarity; needs rapidfuzz
SpaCySemanticMatchResolver, # cosine similarity; needs neo4j-graphrag[nlp]
)
# Exact match (fastest; good baseline)
resolver = SinglePropertyExactMatchResolver(driver)
asyncio.run(resolver.run())
# Fuzzy match (handles typos / alternate spellings)
from neo4j_graphrag.experimental.components.resolver import FuzzyMatchResolver
resolver = FuzzyMatchResolver(driver, threshold=0.9)
asyncio.run(resolver.run())
# Scope resolution to specific labels only:
resolver = SinglePropertyExactMatchResolver(
driver,
filter_query="WHERE n:Organization OR n:Person",
)
asyncio.run(resolver.run())
Run resolvers after ingestion, not inline — bulk merges are faster.
Pipeline always produces this lexical graph layer:
(:Document {id, fileName, status, ...metadata})
-[:HAS_CHUNK]->
(:Chunk {id, text, index, embedding, ...})
-[:NEXT_CHUNK]-> ← linked list for ordered traversal
(:Chunk {...})
(:Chunk)-[:FROM_DOCUMENT]->(:Document) ← back-pointer
Entity extraction adds:
(:Chunk)-[:MENTIONS]->(:Person {name, ...})
(:Chunk)-[:MENTIONS]->(:Organization {name, ...})
(:Person)-[:WORKS_AT]->(:Organization)
Verify after ingestion:
CYPHER 25
MATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk)
RETURN d.fileName, count(c) AS chunks LIMIT 10;
MATCH (c:Chunk)-[:MENTIONS]->(e)
RETURN labels(e)[0] AS type, count(*) AS cnt ORDER BY cnt DESC LIMIT 20;
Use when: non-developers need to ingest docs; rapid prototyping; no Python environment.
Hosted: https://llm-graph-builder.neo4jlabs.com/
Local (Docker):
git clone https://github.com/neo4j-labs/llm-graph-builder
cd llm-graph-builder
# Set OPENAI_API_KEY (or other provider keys) in .env
docker-compose up
# Opens at http://localhost:8080
Supported sources: PDF, plain text, Markdown, images, web pages, YouTube transcripts, S3/GCS bucket uploads.
LLM providers: OpenAI, Gemini, Claude, Llama3, Diffbot, Qwen.
Limitations: best with long-form English text; poor on tabular data (use neo4j-import-skill for CSV/Excel); visual diagrams not extracted.
Use when source is JSON from REST APIs, S3, or file exports.
CYPHER 25
CALL apoc.load.json("https://example.com/articles.json") YIELD value
UNWIND value.articles AS article
CALL (article) {
MERGE (d:Document {id: article.id})
SET d.title = article.title, d.url = article.url, d.publishedAt = article.publishedAt
FOREACH (tag IN article.tags |
MERGE (t:Tag {name: tag})
MERGE (d)-[:HAS_TAG]->(t)
)
} IN TRANSACTIONS OF 1000 ROWS
Local file: apoc.load.json("file:///import/data.json"). File must be in $NEO4J_HOME/import/ or APOC allowlist configured.
Check APOC available: RETURN apoc.version(). APOC is included on all Aura tiers.
from langchain_community.graphs import Neo4jGraph
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from neo4j import GraphDatabase
graph = Neo4jGraph(
url="neo4j+s://xxxx.databases.neo4j.io",
username="neo4j",
password="password",
)
loader = PyPDFLoader("report.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)
embedder = OpenAIEmbeddings()
driver = GraphDatabase.driver(url, auth=("neo4j", "password"))
for i, chunk in enumerate(chunks):
emb = embedder.embed_query(chunk.page_content)
driver.execute_query(
"""
MERGE (doc:Document {id: $doc_id})
SET doc.source = $source
CREATE (c:Chunk {id: $chunk_id, text: $text, embedding: $emb, index: $idx})
CREATE (doc)-[:HAS_CHUNK]->(c)
""",
doc_id=chunk.metadata.get("source", "unknown"),
source=chunk.metadata.get("source"),
chunk_id=f"chunk-{i}",
text=chunk.page_content,
emb=emb,
idx=i,
)
For entity extraction with LangChain: use LLMGraphTransformer (from langchain_experimental.graph_transformers). Produces same :Document/:Chunk/entity pattern.
CYPHER 25
// Prevent duplicate documents
CREATE CONSTRAINT doc_id_unique IF NOT EXISTS
FOR (d:Document) REQUIRE d.id IS UNIQUE;
// Prevent duplicate chunks
CREATE CONSTRAINT chunk_id_unique IF NOT EXISTS
FOR (c:Chunk) REQUIRE c.id IS UNIQUE;
// Entity deduplication
CREATE CONSTRAINT person_name_unique IF NOT EXISTS
FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT org_name_unique IF NOT EXISTS
FOR (o:Organization) REQUIRE o.name IS UNIQUE;
// Vector index for chunk embeddings (adjust dims for your model)
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS
FOR (c:Chunk) ON c.embedding
OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}};
// Poll until index ONLINE:
// SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE'
Do not start ingestion until all indexes are ONLINE:
SHOW INDEXES YIELD name, state WHERE state <> 'ONLINE';
If rows returned: wait, then re-run. ONLINE = safe to ingest.
| Error | Cause | Fix |
|---|---|---|
| LLM extracts node types not in schema | Schema too loose or "EXTRACTED" mode | Define explicit entities + patterns; use Option B schema |
MissingEmbedderError | embedder= omitted | Always pass embedder= even if not doing vector search — pipeline stores embeddings on Chunk nodes |
| Zero entities extracted | LLM context overflow | Reduce chunk_size; switch to model with larger context |
| Duplicate entity nodes after ingestion | Entity resolution not run | Run SinglePropertyExactMatchResolver after bulk ingest |
apoc.load.json permission denied | APOC allowlist not configured | Add URL to apoc.import.file.enabled=true and dbms.security.allow_csv_import_from_file_urls=true |
| Chunking loses sentence mid-way | approximate=False (default) cuts at exact token count | Set approximate=True in FixedSizeSplitter |
chunk_size too large → LLM timeouts | Extraction prompt + chunk exceeds context | Keep chunk_size ≤ 512 for GPT-4o extraction; ≤ 2048 absolute max |
SpaCySemanticMatchResolver fails on Python 3.14 | spaCy not supported on 3.14+ | Use FuzzyMatchResolver or downgrade to Python 3.13 |
neo4j-driver package not found | Deprecated package name since 6.0 | Use neo4j package: pip install neo4j>=6.0.0 |
chunk_size within embedding model limit (≤2048; ≤512 for extraction)chunk_overlap set to 10–15% of chunk_sizeDocument→HAS_CHUNK→Chunk pattern used (enables graph traversal in retrieval)document_metadata populated with source identifierapoc.version() confirmed if using apoc.load.json.env has API keys; .env in .gitignoreMATCH (d:Document)-[:HAS_CHUNK]->(c:Chunk) RETURN count(c)MATCH (c:Chunk)-[:MENTIONS]->(e) RETURN labels(e)[0], count(*)entities/relations/potential_schema deprecated since 1.7.1. Use schema=GraphSchema(...):
from neo4j_graphrag.experimental.components.schema import (
GraphSchema, NodeType, RelationshipType, PropertyType
)
schema = GraphSchema(
node_types=[
NodeType(label="Person", properties=[PropertyType(name="name", type="STRING")]),
NodeType(label="Organization", properties=[PropertyType(name="name", type="STRING")]),
],
relationship_types=[RelationshipType(label="WORKS_AT")],
patterns=[("Person", "WORKS_AT", "Organization")],
)
pipeline = SimpleKGPipeline(llm=llm, driver=driver, embedder=embedder, schema=schema)
schema="FREE" (no guidance) or schema="EXTRACTED" (LLM infers) — exploration only, noisier output.
Override default lexical layer labels (keep defaults unless integrating with existing graph):
from neo4j_graphrag.experimental.components.types import LexicalGraphConfig
# All fields have sensible defaults — only override what differs from your graph's conventions
config = LexicalGraphConfig(
document_node_label="Article", # default: "Document"
chunk_node_label="Passage", # default: "Chunk"
node_to_chunk_relationship_type="HAS_ENTITY", # default: "MENTIONS"
chunk_text_property="content", # default: "text"
)
pipeline = SimpleKGPipeline(..., lexical_graph_config=config)
Default file_loader auto-dispatches by extension (.pdf→PdfLoader, .md→MarkdownLoader).
Supports fsspec URIs (s3://, gcs://). Subclass DataLoader for HTML/web/custom formats:
from neo4j_graphrag.experimental.components.data_loader import DataLoader
from neo4j_graphrag.experimental.components.types import DocumentInfo, LoadedDocument
class WebPageLoader(DataLoader):
async def run(self, filepath, metadata=None):
import httpx
text = httpx.get(filepath).text # strip HTML in real impl
return LoadedDocument(text=text,
document_info=DocumentInfo(path=filepath, metadata=metadata))
pipeline = SimpleKGPipeline(..., file_loader=WebPageLoader(), from_file=True)
Chunking strategy by use-case and full resolver config: references/kg-construction.md.
Load on demand: