Search everything...

Skill

graphify

Builds knowledge graphs from folders of code, docs, papers, or images with community clustering, generating interactive HTML, GraphRAG JSON, audit reports, and Neo4j exports.

Neo4j

npx claudepluginhub adelaidasofia/ai-brain-starter

Configuration

Arguments: [subfolder path to process, e.g. Notes/ or Journals/ — for very large vaults (1000+ files) subset instead of running on the full vault]

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Turn any folder of files into a navigable knowledge graph with community detection, an honest audit trail, and three outputs: interactive HTML, GraphRAG-ready JSON, and a plain-language GRAPH_REPORT.md.

Supporting Assets

LESSONS.mdOPTIMIZATIONS.mdRUNBOOK.mdreferences/INTEGRATIONS.mdreferences/QUERY_COMMANDS.mdreferences/UPDATE_MODES.mdscripts/graphify_canonicalize.pyscripts/graphify_chunk.pyscripts/graphify_prep.pyscripts/graphify_stage_finish.pyscripts/graphify_stage_select.py

SKILL.md

Similar Skills

graphify-knowledge-graph

Builds queryable knowledge graphs from codebases, docs, papers, and images via /graphify in AI coding assistants. Outputs interactive HTML viz, reports, JSON, and cache for exploration and querying.

aradotso-trending-skills-37

learn

Ingests content from Confluence, Google Docs, GitHub repos, remote URLs, or local files (DOCX, PDF, etc.) into Second Brain vault. Converts to Markdown via docling, runs graphify extraction, persists entities.

11 tools

bedrock

understand

7.9k

Analyzes codebase to produce knowledge-graph.json for interactive dashboard exploring architecture, components, and relationships

20 files

understand-anything

Stats

Stars10

Forks0

Last CommitApr 21, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

graphify | ai-brain-starter | ClaudePluginHub

Back to Skills

Skill

graphify

From ai-brain-starter

Builds knowledge graphs from folders of code, docs, papers, or images with community clustering, generating interactive HTML, GraphRAG JSON, audit reports, and Neo4j exports.

npx claudepluginhub adelaidasofia/ai-brain-starter

Configuration

Arguments: [subfolder path to process, e.g. Notes/ or Journals/ — for very large vaults (1000+ files) subset instead of running on the full vault]

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Supporting Assets

SKILL.md

/graphify

⚡ Before running on a corpus larger than ~50 files, READ OPTIMIZATIONS.md. The wrapper scripts in scripts/ (dedupe, regex preflight, word-balanced chunking, label canonicalization, cache integration) typically cut LLM token cost by 80–92% and produce a higher-quality graph. The single most important step is calling graphify_canonicalize.py --cache after merging — without it, the next --update run repays the entire cost. Skip these wrappers and a 1,500-file vault will burn ~10M LLM tokens for the same graph that costs ~1M with them.

Usage

/graphify                                             # full pipeline on current directory → Obsidian vault
/graphify <path>                                      # full pipeline on specific path
/graphify <path> --mode deep                          # thorough extraction, richer INFERRED edges
/graphify <path> --update                             # incremental - re-extract only new/changed files
/graphify <path> --directed                            # build directed graph (preserves edge direction: source→target)
/graphify <path> --whisper-model medium                # use a larger Whisper model for better transcription accuracy (base|small|medium|large)
/graphify <path> --cluster-only                       # rerun clustering on existing graph
/graphify <path> --no-viz                             # skip visualization, just report + JSON
/graphify <path> --html                               # (HTML is generated by default - this flag is a no-op)
/graphify <path> --svg                                # also export graph.svg (embeds in Notion, GitHub)
/graphify <path> --graphml                            # export graph.graphml (Gephi, yEd)
/graphify <path> --neo4j                              # generate graphify-out/cypher.txt for Neo4j
/graphify <path> --neo4j-push bolt://localhost:7687   # push directly to Neo4j
/graphify <path> --mcp                                # start MCP stdio server for agent access
/graphify <path> --watch                              # watch folder, auto-rebuild on code changes (no LLM needed)
/graphify <path> --wiki                               # build agent-crawlable wiki (index.md + one article per community)
/graphify <path> --obsidian --obsidian-dir ~/vaults/my-project  # write vault to custom path (e.g. existing vault)
/graphify add <url>                                   # fetch URL, save to ./raw, update graph
/graphify add <url> --author "Name"                   # tag who wrote it
/graphify add <url> --contributor "Name"              # tag who added it to the corpus
/graphify query "<question>"                          # BFS traversal - broad context
/graphify query "<question>" --dfs                    # DFS - trace a specific path
/graphify query "<question>" --budget 1500            # cap answer at N tokens
/graphify path "AuthModule" "Database"                # shortest path between two concepts
/graphify explain "SwinTransformer"                   # plain-language explanation of a node

What graphify is for

graphify is built around Andrej Karpathy's /raw folder workflow: drop anything into a folder - papers, tweets, screenshots, code, notes - and get a structured knowledge graph that shows you what you didn't know was connected.

Three things it does that Claude alone cannot:

Persistent graph - relationships are stored in graphify-out/graph.json and survive across sessions. Ask questions weeks later without re-reading everything.
Honest audit trail - every edge is tagged EXTRACTED, INFERRED, or AMBIGUOUS. You know what was found vs invented.
Cross-document surprise - community detection finds connections between concepts in different files that you would never think to ask about directly.

Use it for:

A codebase you're new to (understand architecture before touching anything)
A reading list (papers + tweets + notes → one navigable graph)
A research corpus (citation graph + concept graph in one)
Your personal /raw folder (drop everything in, let it grow, query it)

CRITICAL: graph.json edge key

graphify.export.to_json() uses networkx's node_link_data format. Edges are stored under the key "links", NOT "edges". If you ever write a custom script that loads graph.json and reads edges directly, use:

edges = graph.get('links', graph.get('edges', []))  # 'links' first — 'edges' is never set by to_json

Failing to do this gives you 0 edges silently and will corrupt any merge.

The skill's own pipeline uses internal temp files (.graphify_*.json) that DO use "edges" — that is fine because those files are written/read by the same code. This warning only applies to loading graph.json directly.

What You Must Do When Invoked

If no path was given, use . (current directory). Do not ask the user for a path.

Follow these steps in order. Do not skip steps.

Step 1 - Ensure graphify is installed

# Detect the correct Python interpreter (handles pipx, venv, system installs)
GRAPHIFY_BIN=$(which graphify 2>/dev/null)
if [ -n "$GRAPHIFY_BIN" ]; then
    PYTHON=$(head -1 "$GRAPHIFY_BIN" | tr -d '#!')
    case "$PYTHON" in
        *[!a-zA-Z0-9/_.-]*) PYTHON="python3" ;;
    esac
else
    PYTHON="python3"
fi
mkdir -p graphify-out

# Install path with real error surfacing. On failure, STOP and tell the user —
# do not continue with a broken interpreter. Historical bug: pip errors were
# swallowed by tail -3 + silent || chaining and the skill kept going.
if ! "$PYTHON" -c "import graphify" 2>/dev/null; then
    echo "graphify not importable — attempting install..."
    INSTALL_LOG="graphify-out/.graphify_install.log"
    if ! "$PYTHON" -m pip install graphifyy -q 2>"$INSTALL_LOG"; then
        # PEP 668 environments (macOS system python, Debian 12+) need the flag
        if ! "$PYTHON" -m pip install graphifyy -q --break-system-packages 2>>"$INSTALL_LOG"; then
            echo ""
            echo "ERROR: could not install graphify. Last install output:"
            echo "---"
            tail -20 "$INSTALL_LOG"
            echo "---"
            echo ""
            echo "Common fixes:"
            echo "  • Network / corporate proxy: check your pip config"
            echo "  • Permissions: try 'pipx install graphifyy' instead"
            echo "  • Python version: graphify needs Python 3.10+"
            echo ""
            echo "Stopping. Do not re-run /graphify until the import works:"
            echo "  $PYTHON -c 'import graphify'"
            exit 2
        fi
    fi
    # Verify install actually took
    if ! "$PYTHON" -c "import graphify" 2>/dev/null; then
        echo "ERROR: graphify installed but still not importable by $PYTHON."
        echo "This usually means a venv / pyenv mismatch. Run:"
        echo "  which python3 && $PYTHON -c 'import sys; print(sys.path)'"
        exit 2
    fi
    echo "graphify installed successfully."
fi

# Write interpreter path for all subsequent steps
"$PYTHON" -c "import sys; open('graphify-out/.graphify_python', 'w').write(sys.executable)"

If the import succeeds on first try, print nothing and move straight to Step 2. If an install ran, the success line is the only output. If install failed, the skill has already exited — do not continue.

In every subsequent bash block, replace python3 with $(cat graphify-out/.graphify_python) to use the correct interpreter.

Step 2 - Detect files

$(cat graphify-out/.graphify_python) -c "
import json
from graphify.detect import detect
from pathlib import Path
result = detect(Path('INPUT_PATH'))
print(json.dumps(result))
" > graphify-out/.graphify_detect.json

Replace INPUT_PATH with the actual path the user provided. Do NOT cat or print the JSON - read it silently and present a clean summary instead:

Corpus: X files · ~Y words
  code:     N files (.py .ts .go ...)
  docs:     N files (.md .txt ...)
  papers:   N files (.pdf ...)
  images:   N files

Then act on it:

If total_files is 0: stop with "No supported files found in [path]."
If skipped_sensitive is non-empty: mention file count skipped, not the file names.
If total_words > 2,000,000 OR total_files > 200: show the warning and the top 5 subdirectories by file count, then ask which subfolder to run on. Wait for the user's answer before proceeding.
Otherwise: proceed directly to Step 3 - no need to ask anything.

Step 3 - Extract entities and relationships

Before starting: note whether --mode deep was given. You must pass DEEP_MODE=true to every subagent in Step B2 if it was. Track this from the original invocation - do not lose it.

This step has two parts: structural extraction (deterministic, free) and semantic extraction (Claude, costs tokens).

Run Part A (AST) and Part B (semantic) in parallel. Dispatch all semantic subagents AND start AST extraction in the same message. Both can run simultaneously since they operate on different file types. Merge results in Part C as before.

Note: Parallelizing AST + semantic saves 5-15s on large corpora. AST is deterministic and fast; start it while subagents are processing docs/papers.

Part A - Structural extraction for code files

For any code files detected, run AST extraction in parallel with Part B subagents:

$(cat graphify-out/.graphify_python) -c "
import sys, json
from graphify.extract import collect_files, extract
from pathlib import Path
import json

code_files = []
detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
for f in detect.get('files', {}).get('code', []):
    code_files.extend(collect_files(Path(f)) if Path(f).is_dir() else [Path(f)])

if code_files:
    result = extract(code_files)
    Path('graphify-out/.graphify_ast.json').write_text(json.dumps(result, indent=2))
    print(f'AST: {len(result[\"nodes\"])} nodes, {len(result[\"edges\"])} edges')
else:
    Path('graphify-out/.graphify_ast.json').write_text(json.dumps({'nodes':[],'edges':[],'input_tokens':0,'output_tokens':0}))
    print('No code files - skipping AST extraction')
"

Part B - Semantic extraction (parallel subagents)

Fast path: If detection found zero docs, papers, and images (code-only corpus), skip Part B entirely and go straight to Part C. AST handles code - there is nothing for semantic subagents to do.

MANDATORY: You MUST use the Agent tool here. Reading files yourself one-by-one is forbidden - it is 5-10x slower. If you do not use the Agent tool you are doing this wrong.

Before dispatching subagents, print a timing estimate:

Load total_words and file counts from graphify-out/.graphify_detect.json
Estimate agents needed: ceil(uncached_non_code_files / 22) (chunk size is 20-25)
Estimate time: ~45s per agent batch (they run in parallel, so total ≈ 45s × ceil(agents/parallel_limit))
Print: "Semantic extraction: ~N files → X agents, estimated ~Ys"

Step B0 - Check extraction cache first

Before dispatching any subagents, check which files already have cached extraction results:

$(cat graphify-out/.graphify_python) -c "
import json
from graphify.cache import check_semantic_cache
from pathlib import Path

detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
all_files = [f for files in detect['files'].values() for f in files]

cached_nodes, cached_edges, cached_hyperedges, uncached = check_semantic_cache(all_files)

if cached_nodes or cached_edges or cached_hyperedges:
    Path('graphify-out/.graphify_cached.json').write_text(json.dumps({'nodes': cached_nodes, 'edges': cached_edges, 'hyperedges': cached_hyperedges}))
Path('graphify-out/.graphify_uncached.txt').write_text('\n'.join(uncached))
print(f'Cache: {len(all_files)-len(uncached)} files hit, {len(uncached)} files need extraction')
"

Only dispatch subagents for files listed in graphify-out/.graphify_uncached.txt. If all files are cached, skip to Part C directly.

Step B1 - Split into chunks

Load files from graphify-out/.graphify_uncached.txt. Split into chunks of 20-25 files each. Each image gets its own chunk (vision needs separate context). When splitting, group files from the same directory together so related artifacts land in the same chunk and cross-file relationships are more likely to be extracted.

Step B1b - Recovery check (compaction guard)

Before dispatching, check if chunk files from a previous interrupted run already exist:

ls graphify-out/.graphify_chunk_*.json 2>/dev/null | wc -l

If files exist: print "Found N chunk files from previous run — skipping dispatch for those chunks." Skip to Step B3 to merge what's already on disk. Only re-dispatch chunks whose file is missing.

Step B2 - Dispatch ALL subagents in a single message

Call the Agent tool multiple times IN THE SAME RESPONSE - one call per chunk. This is the only way they run in parallel. If you make one Agent call, wait, then make another, you are doing it sequentially and defeating the purpose.

Concrete example for 3 chunks:

[Agent tool call 1: files 1-15]
[Agent tool call 2: files 16-30]  
[Agent tool call 3: files 31-45]

All three in one message. Not three separate messages.

Rate-limit awareness (cold-start users on low API tiers). If many subagents return errors mentioning rate_limit, 429, overloaded_error, or retry later, do NOT pretend the extraction succeeded. Surface the issue plainly:

"Several subagents failed with rate-limit errors. Your Anthropic API tier is capping concurrent requests below what this vault needs. Options:

Wait 1 minute and re-run /graphify --update — graphify caches completed chunks on disk, so only the failed ones re-dispatch.

Split the corpus: /graphify Notes/ then /graphify Journals/ in separate runs.

Raise your API tier at https://console.anthropic.com/settings/limits — Tier 2+ is enough for ~1000-file vaults."

Do not try to workaround by reading files yourself. The cache handles partial runs gracefully. The user makes the tier decision.

Each subagent receives this exact prompt (substitute FILE_LIST, CHUNK_NUM, TOTAL_CHUNKS, and DEEP_MODE):

You are a graphify extraction subagent. Read the files listed and extract a knowledge graph fragment.
Output ONLY valid JSON matching the schema below - no explanation, no markdown fences, no preamble.

Files (chunk CHUNK_NUM of TOTAL_CHUNKS):
FILE_LIST

Rules:
- EXTRACTED: relationship explicit in source (import, call, citation, "see §3.2")
- INFERRED: reasonable inference (shared data structure, implied dependency)
- AMBIGUOUS: uncertain - flag for review, do not omit

Code files: focus on semantic edges AST cannot find (call relationships, shared data, arch patterns).
  Do not re-extract imports - AST already has those.
Doc/paper files: extract named concepts, entities, citations. Also extract rationale — sections that explain WHY a decision was made, trade-offs chosen, or design intent. These become nodes with `rationale_for` edges pointing to the concept they explain.
Image files: use vision to understand what the image IS - do not just OCR.
  UI screenshot: layout patterns, design decisions, key elements, purpose.
  Chart: metric, trend/insight, data source.
  Tweet/post: claim as node, author, concepts mentioned.
  Diagram: components and connections.
  Research figure: what it demonstrates, method, result.
  Handwritten/whiteboard: ideas and arrows, mark uncertain readings AMBIGUOUS.

DEEP_MODE (if --mode deep was given): be aggressive with INFERRED edges - indirect deps,
  shared assumptions, latent couplings. Mark uncertain ones AMBIGUOUS instead of omitting.

Semantic similarity: if two concepts in this chunk solve the same problem or represent the same idea without any structural link (no import, no call, no citation), add a `semantically_similar_to` edge marked INFERRED with a confidence_score reflecting how similar they are (0.6-0.95). Examples:
- Two functions that both validate user input but never call each other
- A class in code and a concept in a paper that describe the same algorithm
- Two error types that handle the same failure mode differently
Only add these when the similarity is genuinely non-obvious and cross-cutting. Do not add them for trivially similar things.

Hyperedges: if 3 or more nodes clearly participate together in a shared concept, flow, or pattern that is not captured by pairwise edges alone, add a hyperedge to a top-level `hyperedges` array. Examples:
- All classes that implement a common protocol or interface
- All functions in an authentication flow (even if they don't all call each other)
- All concepts from a paper section that form one coherent idea
Use sparingly — only when the group relationship adds information beyond the pairwise edges. Maximum 3 hyperedges per chunk.

If a file has YAML frontmatter (--- ... ---), copy source_url, captured_at, author,
  contributor onto every node from that file.

confidence_score is REQUIRED on every edge - never omit it, never use 0.5 as a default:
- EXTRACTED edges: confidence_score = 1.0 always
- INFERRED edges: reason about each edge individually.
  Direct structural evidence (shared data structure, clear dependency): 0.8-0.9.
  Reasonable inference with some uncertainty: 0.6-0.7.
  Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
- AMBIGUOUS edges: 0.1-0.3

Output exactly this JSON (no other text):
{"nodes":[{"id":"filestem_entityname","label":"Human Readable Name","file_type":"code|document|paper|image","source_file":"relative/path","source_location":null,"source_url":null,"captured_at":null,"author":null,"contributor":null}],"edges":[{"source":"node_id","target":"node_id","relation":"calls|implements|references|cites|conceptually_related_to|shares_data_with|semantically_similar_to|rationale_for","confidence":"EXTRACTED|INFERRED|AMBIGUOUS","confidence_score":1.0,"source_file":"relative/path","source_location":null,"weight":1.0}],"hyperedges":[{"id":"snake_case_id","label":"Human Readable Label","nodes":["node_id1","node_id2","node_id3"],"relation":"participate_in|implement|form","confidence":"EXTRACTED|INFERRED","confidence_score":0.75,"source_file":"relative/path"}],"input_tokens":0,"output_tokens":0}

**After outputting the JSON, write it to disk as your final action** — this protects against data loss if the parent session undergoes context compaction before collecting results. Use the Write tool (or Bash) to save to `graphify-out/.graphify_chunk_CHUNK_NUM.json` (replace CHUNK_NUM with the actual chunk number substituted into this prompt). If the directory does not exist, create it first.

Step B3 - Collect, cache, and merge

Wait for all subagents. Then collect results from disk (chunk files written by each subagent), not from the agent return values in memory. This is compaction-safe: if Claude compacts context between subagent completion and this step, the data is already on disk.

$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path

chunk_files = sorted(Path('graphify-out').glob('.graphify_chunk_*.json'))
all_nodes, all_edges, all_hyperedges, in_tok, out_tok = [], [], [], 0, 0
failed = []

for f in chunk_files:
    try:
        data = json.loads(f.read_text())
        if 'nodes' not in data and 'edges' not in data:
            raise ValueError('missing nodes/edges keys')
        all_nodes.extend(data.get('nodes', []))
        all_edges.extend(data.get('edges', []))
        all_hyperedges.extend(data.get('hyperedges', []))
        in_tok += data.get('input_tokens', 0)
        out_tok += data.get('output_tokens', 0)
    except Exception as e:
        failed.append(f.name)
        print(f'Warning: skipping {f.name}: {e}')

if failed:
    print(f'Failed chunks ({len(failed)}): {failed}')
if len(failed) > len(chunk_files) // 2:
    raise SystemExit('More than half the chunks failed. Stopping.')

result = {'nodes': all_nodes, 'edges': all_edges, 'hyperedges': all_hyperedges,
          'input_tokens': in_tok, 'output_tokens': out_tok}
Path('graphify-out/.graphify_semantic_new.json').write_text(json.dumps(result))
print(f'Collected {len(all_nodes)} nodes, {len(all_edges)} edges from {len(chunk_files)} chunks ({len(failed)} failed)')
"

If more than half the chunks have no file on disk (subagent never wrote), stop and tell the user.

Save new results to cache:

$(cat graphify-out/.graphify_python) -c "
import json
from graphify.cache import save_semantic_cache
from pathlib import Path

new = json.loads(Path('graphify-out/.graphify_semantic_new.json').read_text()) if Path('graphify-out/.graphify_semantic_new.json').exists() else {'nodes':[],'edges':[],'hyperedges':[]}
# root= is REQUIRED - default Path('.') breaks if CWD is not vault root
saved = save_semantic_cache(new.get('nodes', []), new.get('edges', []), new.get('hyperedges', []), root=Path.cwd())
print(f'Cached {saved} files')
"

Merge cached + new results into graphify-out/.graphify_semantic.json:

$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path

cached = json.loads(Path('graphify-out/.graphify_cached.json').read_text()) if Path('graphify-out/.graphify_cached.json').exists() else {'nodes':[],'edges':[],'hyperedges':[]}
new = json.loads(Path('graphify-out/.graphify_semantic_new.json').read_text()) if Path('graphify-out/.graphify_semantic_new.json').exists() else {'nodes':[],'edges':[],'hyperedges':[]}

all_nodes = cached['nodes'] + new.get('nodes', [])
all_edges = cached['edges'] + new.get('edges', [])
all_hyperedges = cached.get('hyperedges', []) + new.get('hyperedges', [])
seen = set()
deduped = []
for n in all_nodes:
    if n['id'] not in seen:
        seen.add(n['id'])
        deduped.append(n)

merged = {
    'nodes': deduped,
    'edges': all_edges,
    'hyperedges': all_hyperedges,
    'input_tokens': new.get('input_tokens', 0),
    'output_tokens': new.get('output_tokens', 0),
}
Path('graphify-out/.graphify_semantic.json').write_text(json.dumps(merged, indent=2))
print(f'Extraction complete - {len(deduped)} nodes, {len(all_edges)} edges ({len(cached[\"nodes\"])} from cache, {len(new.get(\"nodes\",[]))} new)')
"

Clean up temp files: rm -f graphify-out/.graphify_cached.json graphify-out/.graphify_uncached.txt graphify-out/.graphify_semantic_new.json graphify-out/.graphify_chunk_*.json

Part C - Merge AST + semantic into final extraction

$(cat graphify-out/.graphify_python) -c "
import sys, json
from pathlib import Path

ast = json.loads(Path('graphify-out/.graphify_ast.json').read_text())
sem = json.loads(Path('graphify-out/.graphify_semantic.json').read_text())

# Merge: AST nodes first, semantic nodes deduplicated by id
seen = {n['id'] for n in ast['nodes']}
merged_nodes = list(ast['nodes'])
for n in sem['nodes']:
    if n['id'] not in seen:
        merged_nodes.append(n)
        seen.add(n['id'])

merged_edges = ast['edges'] + sem['edges']
merged_hyperedges = sem.get('hyperedges', [])
merged = {
    'nodes': merged_nodes,
    'edges': merged_edges,
    'hyperedges': merged_hyperedges,
    'input_tokens': sem.get('input_tokens', 0),
    'output_tokens': sem.get('output_tokens', 0),
}
Path('graphify-out/.graphify_extract.json').write_text(json.dumps(merged, indent=2))
total = len(merged_nodes)
edges = len(merged_edges)
print(f'Merged: {total} nodes, {edges} edges ({len(ast[\"nodes\"])} AST + {len(sem[\"nodes\"])} semantic)')
"

Step 4 - Build graph, cluster, analyze, generate outputs

mkdir -p graphify-out
$(cat graphify-out/.graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.cluster import cluster, score_all
from graphify.analyze import god_nodes, surprising_connections, suggest_questions
from graphify.report import generate
from graphify.export import to_json
from pathlib import Path

extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
detection  = json.loads(Path('graphify-out/.graphify_detect.json').read_text())

G = build_from_json(extraction)
communities = cluster(G)
cohesion = score_all(G, communities)
tokens = {'input': extraction.get('input_tokens', 0), 'output': extraction.get('output_tokens', 0)}
gods = god_nodes(G)
surprises = surprising_connections(G, communities)
labels = {cid: 'Community ' + str(cid) for cid in communities}
# Placeholder questions - regenerated with real labels in Step 5
questions = suggest_questions(G, communities, labels)

report = generate(G, communities, cohesion, labels, gods, surprises, detection, tokens, 'INPUT_PATH', suggested_questions=questions)
Path('graphify-out/GRAPH_REPORT.md').write_text(report)
to_json(G, communities, 'graphify-out/graph.json')

analysis = {
    'communities': {str(k): v for k, v in communities.items()},
    'cohesion': {str(k): v for k, v in cohesion.items()},
    'gods': gods,
    'surprises': surprises,
    'questions': questions,
}
Path('graphify-out/.graphify_analysis.json').write_text(json.dumps(analysis, indent=2))
if G.number_of_nodes() == 0:
    print('ERROR: Graph is empty - extraction produced no nodes.')
    print('Possible causes: all files were skipped, binary-only corpus, or extraction failed.')
    raise SystemExit(1)
print(f'Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges, {len(communities)} communities')
"

If this step prints ERROR: Graph is empty, stop and tell the user what happened - do not proceed to labeling or visualization.

Replace INPUT_PATH with the actual path.

Step 5 - Label communities

Read graphify-out/.graphify_analysis.json. For each community key, look at its node labels and write a 2-5 word plain-language name (e.g. "Attention Mechanism", "Training Pipeline", "Data Loading").

Then regenerate the report and save the labels for the visualizer:

$(cat graphify-out/.graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.cluster import score_all
from graphify.analyze import god_nodes, surprising_connections, suggest_questions
from graphify.report import generate
from pathlib import Path

extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
detection  = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
analysis   = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())

G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}
cohesion = {int(k): v for k, v in analysis['cohesion'].items()}
tokens = {'input': extraction.get('input_tokens', 0), 'output': extraction.get('output_tokens', 0)}

# LABELS - replace these with the names you chose above
labels = LABELS_DICT

# Regenerate questions with real community labels (labels affect question phrasing)
questions = suggest_questions(G, communities, labels)

report = generate(G, communities, cohesion, labels, analysis['gods'], analysis['surprises'], detection, tokens, 'INPUT_PATH', suggested_questions=questions)
Path('graphify-out/GRAPH_REPORT.md').write_text(report)
Path('graphify-out/.graphify_labels.json').write_text(json.dumps({str(k): v for k, v in labels.items()}))
print('Report updated with community labels')
"

Replace LABELS_DICT with the actual dict you constructed (e.g. {0: "Attention Mechanism", 1: "Training Pipeline"}). Replace INPUT_PATH with the actual path.

Step 6 - Generate Obsidian vault (opt-in) + HTML

Generate HTML always (unless --no-viz). Obsidian vault only if --obsidian was explicitly given — skip it otherwise, it generates one file per node.

If --obsidian was given:

If --obsidian-dir <path> was also given, use that path as the vault directory. Otherwise default to graphify-out/obsidian.

$(cat graphify-out/.graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.export import to_obsidian, to_canvas
from pathlib import Path

extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
analysis   = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
labels_raw = json.loads(Path('graphify-out/.graphify_labels.json').read_text()) if Path('graphify-out/.graphify_labels.json').exists() else {}

G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}
cohesion = {int(k): v for k, v in analysis['cohesion'].items()}
labels = {int(k): v for k, v in labels_raw.items()}

obsidian_dir = 'OBSIDIAN_DIR'  # replace with --obsidian-dir value, or 'graphify-out/obsidian' if not given

n = to_obsidian(G, communities, obsidian_dir, community_labels=labels or None, cohesion=cohesion)
print(f'Obsidian vault: {n} notes in {obsidian_dir}/')

to_canvas(G, communities, f'{obsidian_dir}/graph.canvas', community_labels=labels or None)
print(f'Canvas: {obsidian_dir}/graph.canvas - open in Obsidian for structured community layout')
print()
print(f'Open {obsidian_dir}/ as a vault in Obsidian.')
print('  Graph view   - nodes colored by community (set automatically)')
print('  graph.canvas - structured layout with communities as groups')
print('  _COMMUNITY_* - overview notes with cohesion scores and dataview queries')
"

Generate the HTML graph (always, unless --no-viz):

$(cat graphify-out/.graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.export import to_html
from pathlib import Path

extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
analysis   = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
labels_raw = json.loads(Path('graphify-out/.graphify_labels.json').read_text()) if Path('graphify-out/.graphify_labels.json').exists() else {}

G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}
labels = {int(k): v for k, v in labels_raw.items()}

if G.number_of_nodes() > 5000:
    print(f'Graph has {G.number_of_nodes()} nodes - too large for HTML viz. Use Obsidian vault instead.')
else:
    to_html(G, communities, 'graphify-out/graph.html', community_labels=labels or None)
    print('graph.html written - open in any browser, no server needed')
"

Step 7 - Neo4j export (only if --neo4j or --neo4j-push flag)

If --neo4j - generate a Cypher file for manual import:

$(cat graphify-out/.graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.export import to_cypher
from pathlib import Path

G = build_from_json(json.loads(Path('graphify-out/.graphify_extract.json').read_text()))
to_cypher(G, 'graphify-out/cypher.txt')
print('cypher.txt written - import with: cypher-shell < graphify-out/cypher.txt')
"

If --neo4j-push <uri> - push directly to a running Neo4j instance. Ask the user for credentials if not provided:

$(cat graphify-out/.graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.cluster import cluster
from graphify.export import push_to_neo4j
from pathlib import Path

extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
analysis   = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}

result = push_to_neo4j(G, uri='NEO4J_URI', user='NEO4J_USER', password='NEO4J_PASSWORD', communities=communities)
print(f'Pushed to Neo4j: {result[\"nodes\"]} nodes, {result[\"edges\"]} edges')
"

Replace NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD with actual values. Default URI is bolt://localhost:7687, default user is neo4j. Uses MERGE - safe to re-run without creating duplicates.

Step 7b - SVG export (only if --svg flag)

$(cat graphify-out/.graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.export import to_svg
from pathlib import Path

extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
analysis   = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())
labels_raw = json.loads(Path('graphify-out/.graphify_labels.json').read_text()) if Path('graphify-out/.graphify_labels.json').exists() else {}

G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}
labels = {int(k): v for k, v in labels_raw.items()}

to_svg(G, communities, 'graphify-out/graph.svg', community_labels=labels or None)
print('graph.svg written - embeds in Obsidian, Notion, GitHub READMEs')
"

Step 7c - GraphML export (only if --graphml flag)

$(cat graphify-out/.graphify_python) -c "
import json
from graphify.build import build_from_json
from graphify.export import to_graphml
from pathlib import Path

extraction = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
analysis   = json.loads(Path('graphify-out/.graphify_analysis.json').read_text())

G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}

to_graphml(G, communities, 'graphify-out/graph.graphml')
print('graph.graphml written - open in Gephi, yEd, or any GraphML tool')
"

Step 7d - MCP server (only if --mcp flag)

python3 -m graphify.serve graphify-out/graph.json

This starts a stdio MCP server that exposes tools: query_graph, get_node, get_neighbors, get_community, god_nodes, graph_stats, shortest_path. Add to Claude Desktop or any MCP-compatible agent orchestrator so other agents can query the graph live.

To configure in Claude Desktop, add to claude_desktop_config.json:

{
  "mcpServers": {
    "graphify": {
      "command": "python3",
      "args": ["-m", "graphify.serve", "/absolute/path/to/graphify-out/graph.json"]
    }
  }
}

Step 8 - Token reduction benchmark (only if total_words > 5000)

If total_words from graphify-out/.graphify_detect.json is greater than 5,000, run:

$(cat graphify-out/.graphify_python) -c "
import json
from graphify.benchmark import run_benchmark, print_benchmark
from pathlib import Path

detection = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
result = run_benchmark('graphify-out/graph.json', corpus_words=detection['total_words'])
print_benchmark(result)
"

Print the output directly in chat. If total_words <= 5000, skip silently - the graph value is structural clarity, not token compression, for small corpora.

Step 9 - Save manifest, update cost tracker, clean up, and report

$(cat graphify-out/.graphify_python) -c "
import json
from pathlib import Path
from datetime import datetime, timezone
from graphify.detect import save_manifest

# Save manifest for --update
detect = json.loads(Path('graphify-out/.graphify_detect.json').read_text())
save_manifest(detect['files'])

# Update cumulative cost tracker
extract = json.loads(Path('graphify-out/.graphify_extract.json').read_text())
input_tok = extract.get('input_tokens', 0)
output_tok = extract.get('output_tokens', 0)

cost_path = Path('graphify-out/cost.json')
if cost_path.exists():
    cost = json.loads(cost_path.read_text())
else:
    cost = {'runs': [], 'total_input_tokens': 0, 'total_output_tokens': 0}

cost['runs'].append({
    'date': datetime.now(timezone.utc).isoformat(),
    'input_tokens': input_tok,
    'output_tokens': output_tok,
    'files': detect.get('total_files', 0),
})
cost['total_input_tokens'] += input_tok
cost['total_output_tokens'] += output_tok
cost_path.write_text(json.dumps(cost, indent=2))

print(f'This run: {input_tok:,} input tokens, {output_tok:,} output tokens')
print(f'All time: {cost[\"total_input_tokens\"]:,} input, {cost[\"total_output_tokens\"]:,} output ({len(cost[\"runs\"])} runs)')
"
rm -f graphify-out/.graphify_detect.json graphify-out/.graphify_extract.json graphify-out/.graphify_ast.json graphify-out/.graphify_semantic.json graphify-out/.graphify_analysis.json graphify-out/.graphify_labels.json
rm -f graphify-out/.needs_update 2>/dev/null || true

Tell the user (omit the obsidian line unless --obsidian was given):

Graph complete. Outputs in PATH_TO_DIR/graphify-out/

  graph.html            - interactive graph, open in browser
  GRAPH_REPORT.md       - audit report
  graph.json            - raw graph data
  obsidian/             - Obsidian vault (only if --obsidian was given)

Replace PATH_TO_DIR with the actual absolute path of the directory that was processed.

Then paste these sections from GRAPH_REPORT.md directly into the chat:

God Nodes
Surprising Connections
Suggested Questions

Do NOT paste the full report - just those three sections. Keep it concise.

Then immediately offer to explore. Pick the single most interesting suggested question from the report - the one that crosses the most community boundaries or has the most surprising bridge node - and ask:

"The most interesting question this graph can answer: [question]. Want me to trace it?"

If the user says yes, run /graphify query "[question]" on the graph and walk them through the answer using the graph structure - which nodes connect, which community boundaries get crossed, what the path reveals. Keep going as long as they want to explore. Each answer should end with a natural follow-up ("this connects to X - want to go deeper?") so the session feels like navigation, not a one-shot report.

The graph is the map. Your job after the pipeline is to be the guide.

Alternative Modes

Incremental update (--update): See UPDATE_MODES.md
Re-cluster only (--cluster-only): See UPDATE_MODES.md

Query Commands

Query (/graphify query): See QUERY_COMMANDS.md
Path (/graphify path): See QUERY_COMMANDS.md
Explain (/graphify explain): See QUERY_COMMANDS.md

Integrations

Add URL (/graphify add): See INTEGRATIONS.md
Watch mode (--watch): See INTEGRATIONS.md
Git hook: See INTEGRATIONS.md
CLAUDE.md integration: See INTEGRATIONS.md

Honesty Rules

Never invent an edge. If unsure, use AMBIGUOUS.
Never skip the corpus check warning.
Always show token cost in the report.
Never hide cohesion scores behind symbols - show the raw number.
Never run HTML viz on a graph with more than 5,000 nodes without warning the user.