npx claudepluginhub nicholasglazer/gnosis-mcpThis skill uses the workspace's default tool permissions.
Every RAG system has a distribution where it breaks. The only way to
Guides RAG evaluation: error analysis, synthetic QA/adversarial dataset building, Recall@k/Precision@k metrics for retrieval, faithfulness/relevance for generation, chunking optimization.
Evaluates RAG pipeline retrieval (Recall@k, Precision@k, MRR, NDCG@k) and generation (faithfulness, relevance) quality separately. For AI features using retrieval like search, knowledge bases, or document QA.
Share bugs, ideas, or general feedback.
Every RAG system has a distribution where it breaks. The only way to know whether gnosis-mcp's defaults are right for your corpus is to measure them against your queries. This skill runs that measurement.
/gnosis:tune # Quick sweep (5 chunk sizes, keyword mode)
/gnosis:tune full # Extended sweep (keyword + hybrid + rerank on/off)
/gnosis:tune --golden ./q.jsonl # Use a specific golden-query file
You need:
A corpus — anywhere on disk. Will be passed to gnosis-mcp ingest.
A golden-query file — one JSON object per line:
{"query": "how does our auth work", "expected_paths": ["docs/auth", "architecture/auth"]}
{"query": "stripe webhook failure runbook", "expected_paths": ["runbooks/stripe"]}
expected_paths uses substring match against the returned file_path,
case-insensitive. Generous enough that you don't need exact path
memorization.
20 hand-written queries is enough to get signal. 50 is plenty.
Runs gnosis-mcp ingest 5 times with different chunk sizes, scoring
each with your golden set. Keyword mode only (fast, no embedding cost).
# Default path if none given: ./docs + ./golden.jsonl
CORPUS=${CORPUS:-./docs}
GOLDEN=${GOLDEN:-./golden.jsonl}
for size in 1000 1500 2000 2500 3000; do
uv run --with 'gnosis-mcp[embeddings] @ gnosis-mcp' \
python tests/bench/bench_real_corpus.py \
--corpus "$CORPUS" --golden "$GOLDEN" \
--modes keyword --chunk-size $size \
--out bench-results/tune-chunk${size}.json
done
(If gnosis-mcp was installed via pip rather than from source, invoke
bench_real_corpus.py from the repo you cloned, and drop the
uv run --with '... @ .' prefix.)
Expected runtime: 5–10 minutes per size on a typical laptop (ingest is the dominant cost; scoring itself is sub-second).
Output table:
chunk chars │ nDCG@10 │ MRR │ Hit@5 │ p95 │ ingest
────────────┼─────────┼────────┼────────┼──────────┼────────
1000 │ 0.8557 │ 0.8067 │ 0.92 │ 30 ms │ 592 s
1500 │ 0.8529 │ 0.7967 │ 0.92 │ 7 ms │ 234 s
2000 │ 0.8702 │ 0.7933 │ 0.92 │ 7 ms │ 210 s ← peak
2500 │ 0.8602 │ 0.7880 │ 0.92 │ 7 ms │ 195 s
3000 │ 0.8459 │ 0.7880 │ 0.92 │ 7 ms │ 182 s
Pick the chunk size with the best nDCG@10 that doesn't blow your ingest-time budget. Persist it:
# In your shell profile / .env
export GNOSIS_MCP_CHUNK_SIZE=2000
If multiple sizes tie at the top (plateau), pick the larger one — same quality, fewer chunks, faster ingest.
full)Adds hybrid mode and optional reranker A/B — ~30-40 minutes depending
on corpus size and whether [reranking] is installed.
for size in 1500 2000 3000; do
for mode in keyword hybrid hybrid+rerank; do
uv run --with 'gnosis-mcp[embeddings,reranking] @ gnosis-mcp' \
python tests/bench/bench_real_corpus.py \
--corpus "$CORPUS" --golden "$GOLDEN" \
--modes "$mode" --chunk-size $size \
--out "bench-results/tune-${mode//+/_}-${size}.json"
done
done
nDCG delta between keyword and hybrid: if < 0.02, your corpus is vocabulary-matched (queries share exact terms with docs). Dense retrieval adds latency without lift — leave hybrid off.
nDCG delta when reranker is added: on dev docs specifically, we
measure -27 nDCG (!). If you see a similar drop, your corpus has
the same distribution problem — disable
GNOSIS_MCP_RERANK_ENABLED. If it's +3-10 on your corpus, keep it
on.
Chunk-size peak location: tells you the typical topic-coherent block length in your corpus. API-reference-heavy corpora peak lower (1000-1500 chars). Long-form prose peaks higher (2500-3000).
If you've been using gnosis-mcp for a while, real queries are already logged. Generate a seed golden file from them:
sqlite3 ~/.local/share/gnosis-mcp/docs.db \
"SELECT DISTINCT query FROM search_access_log
WHERE timestamp > date('now', '-30 days')
ORDER BY count(*) DESC LIMIT 30;" \
| awk '{print "{\"query\":\""$0"\", \"expected_paths\": []}"}' > golden-seed.jsonl
Then fill in expected_paths by hand — 5 minutes of work, gives you a
golden set grounded in how you actually use the system.
After picking a chunk size and reranker/hybrid choice, persist it so
every gnosis-mcp ingest (manual or via --watch) uses the tuned
settings:
# Persistent across processes
export GNOSIS_MCP_CHUNK_SIZE=2000
export GNOSIS_MCP_RERANK_ENABLED=false # if tune showed it hurts
# Re-ingest with the new config
gnosis-mcp ingest ./docs --embed --wipe
GNOSIS_MCP_EMBED_MODEL) — the semantic side
of hybrid now behaves differentlyOtherwise: tune once per year. It's not free but it's not frequent.
/gnosis:ingest — re-ingest after you've picked the winner