From neo4j-skills
Creates and manages Neo4j vector indexes for ANN/kNN similarity search on node/relationship embeddings using SEARCH clause (2026.01+) or db.index.vector.queryNodes() (2025.x), configures HNSW/quantization, batch-updates embeddings.
npx claudepluginhub neo4j-contrib/neo4j-skillsThis skill is limited to using the following tools:
- Creating a vector index (`CREATE VECTOR INDEX`) on nodes or relationships
Builds GraphRAG retrieval pipelines on Neo4j using neo4j-graphrag Python package. Covers retriever selection (VectorRetriever, HybridRetriever, Cypher variants), retrieval_query Cypher fragments, LLM wiring, embedder/index setup, LangChain/LlamaIndex integration.
Provides patterns and Python templates for similarity search with vector databases, including metrics, indexes, and Pinecone implementation. Use for semantic search, RAG, recommendations, and scaling.
Configures pgvector in PostgreSQL for semantic search, vector embeddings, RAG with HNSW/IVFFlat indexes, halfvec storage, quantization, and performance tuning.
Share bugs, ideas, or general feedback.
CREATE VECTOR INDEX) on nodes or relationshipsSEARCH clause (2026.01+) or db.index.vector.queryNodes() (2025.x)neo4j-graphrag-skill)neo4j-graphrag-skilldb.index.fulltext.queryNodes) → neo4j-cypher-skillneo4j-gds-skillneo4j-cypher-skillDrives syntax choice:
CALL dbms.components() YIELD versions RETURN versions[0] AS neo4j_version
| Version | Use |
|---|---|
2026.01 or higher | SEARCH clause (in-index filtering, preferred) |
2025.x | db.index.vector.queryNodes() procedure (deprecated 2026.04 — use SEARCH when on 2026.x) |
Node index (single label):
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
OPTIONS {
indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine',
`vector.quantization.enabled`: true,
`vector.hnsw.m`: 16,
`vector.hnsw.ef_construction`: 100
}
}
Node index with filterable properties [2026.01+] — WITH declares which properties can be used in SEARCH ... WHERE:
CYPHER 25
CREATE VECTOR INDEX chunk_embedding IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
WITH [c.source, c.lang, c.published_year] // stored as metadata; filterable in SEARCH WHERE
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }
Multi-label index with filterable properties [2026.01+]:
CYPHER 25
CREATE VECTOR INDEX doc_embedding IF NOT EXISTS
FOR (n:Document|Article) ON n.embedding
WITH [n.author, n.published_year, n.lang]
OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' } }
Relationship index:
CYPHER 25
CREATE VECTOR INDEX rel_embedding IF NOT EXISTS
FOR ()-[r:HAS_CHUNK]-() ON (r.embedding)
OPTIONS { indexConfig: { `vector.dimensions`: 768, `vector.similarity_function`: 'cosine' } }
WITH property types — only scalar types allowed: INTEGER, FLOAT, STRING, BOOLEAN, DATE, ZONED DATETIME, LOCAL DATETIME, ZONED TIME, LOCAL TIME, DURATION. Not allowed: LIST, POINT, or the vector property itself.
Index config reference:
| Parameter | Type | Default | Notes |
|---|---|---|---|
vector.dimensions | INTEGER 1–4096 | none | Required; must match embedding model exactly |
vector.similarity_function | STRING | 'cosine' | 'cosine' or 'euclidean' |
vector.quantization.enabled | BOOLEAN | true | Reduces storage; slight accuracy tradeoff; needs vector-2.0+ (5.18+) |
vector.hnsw.m | INTEGER 1–512 | 16 | HNSW graph connections; higher = better recall, more memory |
vector.hnsw.ef_construction | INTEGER 1–3200 | 100 | Build-time candidates; higher = better recall, slower build |
Similarity function choice:
| Use case | Function |
|---|---|
| Normalized embeddings (OpenAI, Cohere, Voyage, Google) | 'cosine' |
| Unnormalized / raw distance matters | 'euclidean' |
Index builds asynchronously — do NOT query until ONLINE:
SHOW VECTOR INDEXES YIELD name, state, populationPercent
WHERE name = 'chunk_embedding'
RETURN name, state, populationPercent
Poll every 5s until state = 'ONLINE' and populationPercent = 100.0. If state = 'FAILED' → stop, check logs.
Shell poll (cypher-shell):
until cypher-shell -u neo4j -p "$NEO4J_PASSWORD" \
"SHOW VECTOR INDEXES YIELD name, state WHERE name='chunk_embedding' RETURN state" \
| grep -q ONLINE; do
sleep 5
done
Batch UNWIND pattern (use for > 100 nodes — never one-node-per-transaction):
from neo4j import GraphDatabase
driver = GraphDatabase.driver(uri, auth=(user, password))
def embed_batch(texts: list[str]) -> list[list[float]]:
response = openai_client.embeddings.create(
model="text-embedding-3-small", input=texts
)
return [r.embedding for r in response.data]
def store_embeddings(records: list[dict], batch_size: int = 500):
expected_dim = 1536 # must match vector.dimensions
texts = [r["text"] for r in records]
embeddings = embed_batch(texts)
for emb in embeddings:
assert len(emb) == expected_dim, f"Dim mismatch: {len(emb)} != {expected_dim}"
rows = [{"id": r["id"], "embedding": emb}
for r, emb in zip(records, embeddings)]
for i in range(0, len(rows), batch_size):
driver.execute_query(
"UNWIND $rows AS row MATCH (c:Chunk {id: row.id}) SET c.embedding = row.embedding",
rows=rows[i:i+batch_size]
)
❌ Never create index after embeddings are already stored — always create index first. ✅ Create index → poll ONLINE → ingest embeddings.
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
LIMIT 10
) SCORE AS score
RETURN c.text, score
ORDER BY score DESC
With in-index filter [2026.01+] — properties must be declared in WITH at index creation:
// Index must have been created with: WITH [c.source, c.lang, c.published_year]
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
WHERE c.source = $source AND c.lang = 'en' AND c.published_year >= 2024
LIMIT 10
) SCORE AS score
RETURN c.text, c.source, score
ORDER BY score DESC
Filtering strategy — choose one:
| Strategy | When to use | Tradeoff |
|---|---|---|
In-index WHERE [2026.01+] | Filters on pre-declared WITH properties; known at index design time | Fast, consistent latency; properties must be declared upfront |
| Post-filter (MATCH + procedure) | Arbitrary Cypher predicates, graph traversal, OR/NOT | Full flexibility; may over-fetch then discard |
| Pre-filter (MATCH first, then SEARCH) | Small known candidate set; exact nearest-neighbor within subset | Deterministic; slow on large candidate sets |
In-index WHERE hard limits [2026.01+]:
WITH [...] at index creation — undeclared properties silently fall back to post-filteringINTEGER, FLOAT, STRING, BOOLEAN, temporal types — not VECTOR/LIST/POINTCYPHER 25
CALL db.index.vector.queryNodes('chunk_embedding', 50, $queryEmbedding)
YIELD node AS c, score
WHERE c.source = $source // post-filter: fetch more, then filter
RETURN c.text, score
ORDER BY score DESC LIMIT 10
Relationship index procedure:
CYPHER 25
CALL db.index.vector.queryRelationships('rel_embedding', 5, $queryEmbedding)
YIELD relationship AS r, score
RETURN r.text, score
SEARCH clause hard limits (all versions):
$indexName not allowed — use literal string)Vector search as entry point, then graph hop:
CYPHER 25
MATCH (c:Chunk)
SEARCH c IN (
VECTOR INDEX chunk_embedding
FOR $queryEmbedding
LIMIT 10
) SCORE AS score
MATCH (c)<-[:HAS_CHUNK]-(a:Article)
OPTIONAL MATCH (a)-[:MENTIONS]->(org:Organization)
RETURN c.text, a.title, score, collect(DISTINCT org.name) AS organizations
ORDER BY score DESC
For full retrieval_query pipelines, HybridCypherRetriever, or neo4j-graphrag library → delegate to neo4j-graphrag-skill.
| Provider / Model | Dimensions | Similarity | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | cosine | Default; reducible to 256–1536 via dimensions= param |
| OpenAI text-embedding-3-large | 3072 | cosine | Reducible to 256–3072 |
| OpenAI text-embedding-ada-002 | 1536 | cosine | Legacy; prefer 3-small |
| Cohere embed-v3 (English) | 1024 | cosine | Use input_type='search_document' at ingest, 'search_query' at query |
| Voyage voyage-3-large | 1024 | cosine | High quality; needs voyage-ai package |
| Google text-embedding-004 | 768 | cosine | Via Vertex AI |
| Ollama nomic-embed-text | 768 | cosine | Local dev/testing |
| Ollama mxbai-embed-large | 1024 | cosine | Local; production-quality |
vector.dimensions must exactly match model output — no auto-truncation.
Ad-hoc similarity (not for kNN search — use index for that):
MATCH (a:Chunk {id: $id1}), (b:Chunk {id: $id2})
RETURN vector.similarity.cosine(a.embedding, b.embedding) AS sim
// vector.similarity.euclidean(a, b) — same signature, 0–1 range
// vector_distance (2025.10+) — metrics: EUCLIDEAN, EUCLIDEAN_SQUARED, MANHATTAN, COSINE, DOT, HAMMING
// Returns distance (lower = more similar, inverse of similarity)
RETURN vector_distance(a.embedding, b.embedding, 'COSINE') AS dist
// vector_dimension_count (2025.10+)
RETURN vector_dimension_count(n.embedding) AS dims
// vector_norm (2025.20+) — metrics: EUCLIDEAN, MANHATTAN
RETURN vector_norm(n.embedding, 'EUCLIDEAN') AS norm
Convert LIST to typed VECTOR:
// vector(value, dimension, coordinateType)
// coordinateType: FLOAT64, FLOAT32, INTEGER8/16/32/64
WITH vector([1.0, 2.0, 3.0], 3, 'FLOAT32') AS v
RETURN vector_dimension_count(v)
// Show all vector indexes with config
SHOW VECTOR INDEXES YIELD name, state, populationPercent,
labelsOrTypes, properties, indexConfig
RETURN name, state, populationPercent, labelsOrTypes, properties, indexConfig;
// Drop (node data unchanged — only index structure removed)
DROP INDEX chunk_embedding IF EXISTS;
// No ALTER VECTOR INDEX — to change dimensions or similarity function:
// 1. DROP INDEX old_index IF EXISTS
// 2. CREATE VECTOR INDEX new_index ... with new OPTIONS
// 3. Re-generate all embeddings with new model
// 4. Poll until ONLINE
| Error | Cause | Fix |
|---|---|---|
IllegalArgumentException: Index dimension mismatch | Stored embedding dim ≠ vector.dimensions | Fix embed generation; drop + recreate index with correct dim |
| Search returns incomplete results | Index still POPULATING | Poll until state = 'ONLINE' |
Unknown procedure db.index.vector.queryNodes | Neo4j < 5.11 | No vector index support below 5.11; upgrade |
SEARCH clause not available | Neo4j < 2026.01 | Use queryNodes() procedure |
OR/NOT not allowed in SEARCH WHERE | SEARCH in-index filter restriction | Move complex predicates to outer WHERE after SEARCH |
| Zero results from correct query | Wrong similarity function or all-zeros embedding | Verify with vector.similarity.cosine(); check embed call succeeded |
| Score always 1.0 | All-zeros or identical vectors | Embedding generation failed; add dimension assertion before ingest |
vector.quantization.enabled option rejected | provider vector-1.0 (Neo4j < 5.18) | Omit quantization option or upgrade to 5.18+ |
vector.dimensions matches embedding model output exactlycosine for normalized, euclidean for distance-based)state = 'ONLINE' before first querySEARCH clause on Neo4j >= 2026.01 (preferred); procedure fallback only on 2025.x (deprecated 2026.04)WHERE uses AND-only predicates with scalar typesGenerate embeddings at query time without external Python code. Use ai.text.embed() — the current API since [2025.12]:
// Syntax (requires CYPHER 25)
CYPHER 25
// ai.text.embed(resource :: STRING, provider :: STRING, configuration :: MAP) :: VECTOR
Provider strings are lowercase ('openai', 'vertexai', 'bedrock-titan', 'azure-openai'). Full provider config → neo4j-genai-plugin-skill.
Full query pattern — embed at query time, search immediately (procedure fallback for 2025.x):
CYPHER 25
WITH ai.text.embed(
"What are good open source projects",
"openai",
{ token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
CALL db.index.vector.queryNodes('chunk_embedding', 6, userEmbedding) // deprecated 2026.04
YIELD node AS c, score
RETURN c.text, score
ORDER BY score DESC
With SEARCH clause (2026.01+):
CYPHER 25
WITH ai.text.embed("my query", "openai", { token: $openaiKey, model: 'text-embedding-3-small' }) AS userEmbedding
MATCH (c:Chunk)
SEARCH c IN (VECTOR INDEX chunk_embedding FOR userEmbedding LIMIT 6) SCORE AS score
RETURN c.text, score
ORDER BY score DESC
❌ Never pass API key as literal string in production — use $param or apoc.static.get().
✅ Use $openaiKey parameter; inject via driver params dict.
Rule: Use same model at ingest time and query time — embeddings from different models are not comparable.
Deprecated (still works but do not use in new code):
genai.vector.encode() [deprecated] → use ai.text.embed() [2025.12]genai.vector.encodeBatch() [deprecated] → use CALL ai.text.embedBatch() [2025.12]genai.vector.listEncodingProviders() [deprecated] → use CALL ai.text.embed.providers() [2025.12]For full ai.text.* reference (completion, structured output, chat, tokenization) → neo4j-genai-plugin-skill.
Set vector property via Cypher (e.g. during LOAD CSV or MERGE pipeline):
LOAD CSV WITH HEADERS FROM 'https://example.com/data.csv' AS row
MERGE (q:Question {text: row.question})
WITH q, row
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))
Use when embedding is already in CSV/JSON form as a string — apoc.convert.fromJsonList() converts "[0.1,0.2,...]" to LIST<FLOAT>.
For Python-generated embeddings, use the Python UNWIND batch pattern (Step 3) instead.
Existing table (Step 1) gives the basic rule. Additional guidance from course patterns:
Choose based on training loss function:
'cosine''euclidean''cosine' (all major hosted APIs use it)Common pitfall — wrong similarity function:
❌ Created index with 'euclidean' but model outputs L2-normalized vectors
→ scores are mathematically correct but rankings differ from expected cosine order
→ no error thrown; wrong results silently returned
✅ Verify: run vector.similarity.cosine(a.embedding, b.embedding) manually on known
similar pairs — score should be > 0.9 for near-duplicate text
Sanity check query after index creation:
MATCH (c:Chunk) WITH c LIMIT 2
WITH collect(c) AS nodes
RETURN vector.similarity.cosine(nodes[0].embedding, nodes[1].embedding) AS cosine_check,
vector.similarity.euclidean(nodes[0].embedding, nodes[1].embedding) AS euclidean_check
If both return null → embeddings not set. If cosine returns 1.0 → identical vectors (embed call failed).
| Gotcha | Detail | Fix |
|---|---|---|
| Index not ONLINE at ingest time | Inserting nodes before index exists is valid — index auto-populates. But querying during POPULATING returns partial results | Always poll state = 'ONLINE' before first query |
| Wrong dimensions — silent failure | Stored vector dim ≠ vector.dimensions → IllegalArgumentException at query time, not at ingest time | Assert len(emb) == expected_dim before every SET c.embedding |
| Different models at ingest vs query | No error; cosine scores ~0.3–0.5 for clearly similar text | Use same model string/version for both; store model name as node metadata |
| Missing model at query | ai.text.embed returns null silently if provider config wrong | Test encode call standalone; check CYPHER 25 RETURN ai.text.embed(...) before embedding into pipeline |
| Large single-transaction ingest | One transaction for 10k nodes → OOM or timeout | Use UNWIND $rows ... CALL IN TRANSACTIONS OF 500 ROWS or Python batch loop |
| Chunk overlap not set | Adjacent chunks with no overlap → context at boundaries lost → poor recall for cross-paragraph queries | Set chunk_overlap ≥ 10% of chunk_size |
Load on demand:
genai.vector.encode()