npx claudepluginhub nicholasglazer/gnosis-mcpThis skill uses the workspace's default tool permissions.
`gnosis-mcp eval` in ~10 lines of wrapper logic. Reports retrieval
Evaluates RAG pipeline retrieval (Recall@k, Precision@k, MRR, NDCG@k) and generation (faithfulness, relevance) quality separately. For AI features using retrieval like search, knowledge bases, or document QA.
Guides RAG evaluation: error analysis, synthetic QA/adversarial dataset building, Recall@k/Precision@k metrics for retrieval, faithfulness/relevance for generation, chunking optimization.
Share bugs, ideas, or general feedback.
gnosis-mcp eval in ~10 lines of wrapper logic. Reports retrieval
quality against your golden queries, compares to last known numbers,
and points at /gnosis:tune if anything looks regressed.
Different from /gnosis:tune: tune sweeps configurations looking
for the best chunk size / embedder / rerank combo. Eval just reports
current numbers. Run eval often (after each ingest, whenever the
corpus changes). Run tune occasionally (after corpus shape change,
new embedder, weekend of experimentation).
/gnosis:eval # full run — print numbers + interpret + compare
/gnosis:eval quick # numbers only, no interpretation
/gnosis:eval save # save current result as new baseline
/gnosis:eval diff # compare current to last saved baseline
gnosis-mcp eval --json
Parses to:
{
"queries": 10,
"hit_at_5": 1.000,
"mrr": 0.950,
"mean_precision_at_5": 0.668,
"ndcg_at_10": 0.871
}
If the command errors (no [embeddings] extra, no golden file, empty
DB), explain the exact cause and the fix. Don't proceed with empty
numbers — fail loudly.
Report as a compact table:
| Metric | Value | Meaning |
|---|---|---|
| Hit@5 | 0.92 | 9 of 10 queries find the right doc in the top 5 results |
| MRR | 0.79 | On average the right doc ranks ~1.3 in the list (1 / 0.79) |
| nDCG@10 | 0.87 | Ranking quality — 1.0 is perfect, random baseline is ~0.15 |
| Precision@5 | 0.67 | Of the top 5 results, 67% are relevant |
Interpretation thresholds (rough heuristics, corpus-dependent):
/gnosis:tune to sweep chunk sizes./gnosis:tune covers both experiments).Baseline lives at ~/.local/share/gnosis-mcp/eval-baseline.json.
If baseline exists (default + diff modes):
Metric Now Last (2026-04-18) Δ
Hit@5 0.9200 0.9200 —
MRR 0.7933 0.7933 —
nDCG@10 0.8702 0.8702 —
P@5 0.6680 0.6680 —
Flag any Δ worse than -2 points in red and explicitly ask the user if the regression is expected. "Expected" reasons: new corpus shape, different chunk size, swapped embedder. "Not expected": accidental regression → investigate before committing the drop.
If no baseline exists (default mode only — first-time run):
No baseline found. Save current numbers as baseline? (y/N)
On y, write the JSON to ~/.local/share/gnosis-mcp/eval-baseline.json
and confirm. On N, skip — user runs /gnosis:eval save later when
they're happy with a specific state.
save mode: unconditionally overwrite the baseline with current
numbers. Useful after an intentional improvement (e.g., /gnosis:tune
found a better chunk size, user re-ingested, wants to lock in the new
baseline).
Based on the numbers, point at the next action:
| State | Next action |
|---|---|
| Everything green, no regression | "Numbers look healthy. Run /gnosis:ingest <path> --prune after any doc change to keep them that way." |
| Hit@5 dropped > 2 points | "Regression. Run /gnosis:status diag to rule out corruption, then git diff any recent ingest or config changes." |
| Hit@5 healthy but MRR weak | "Ranking is the weak link. /gnosis:tune can test title-prepending — sometimes worth +3 MRR points on keyword-saturated corpora." |
| Hit@5 < 0.70 | "Chunk size or corpus shape issue. /gnosis:tune will sweep chunk sizes 1000-4000 and report the optimum — ~10 min runtime." |
| No regression but first time | "Run /gnosis:eval save to lock these as your baseline. Future runs will diff against this." |
/gnosis:evaltests/bench/bench_beir.py instead. That's external validity; eval is in-distribution validity.tests/bench/bench_mcp_e2e.py.scripts/bench-embedders.sh for the full matrix.See docs/how-we-measure-search.md
for a plain-English explainer of Hit@K / MRR / nDCG (if that doc
exists in this corpus; otherwise point at the gnosismcp.com version:
https://gnosismcp.com/docs/how-we-measure-search).