npx claudepluginhub haabe/mycelium --plugin myceliumThis skill uses the workspace's default tool permissions.
Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.
Writes, edits, reviews, and validates AgentV EVAL.yaml files for agent skill evaluations. Adds test cases, configures graders, converts from evals.json or chat transcripts.
Evaluates and optimizes LLM agent output using MLflow datasets, scorers, judges, and tracing. Improves tool selection accuracy, answer quality, reduces costs, fixes incomplete responses.
Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.
Share bugs, ideas, or general feedback.
Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.
run <category/name>.claude/evals/scenarios/<category>/<name>.yml.claude/evals/results/<timestamp>-<name>.jsonrun-all [category].claude/evals/scenarios/**/*.ymlstatus: retired.claude/evals/pass-history.json with each resultrun-split <optimization|holdout>.claude/evals/scenarios/**/*.ymlsplit field matching the requested setstatus: retired.claude/evals/pass-history.json with each resultreport.claude/evals/results/| Category | Pass Rate | Avg Iterations | Avg Time | Notes |
|-------------|-----------|----------------|----------|-------|
| discovery | ... | ... | ... | |
| delivery | ... | ... | ... | |
| integration | ... | ... | ... | |
| **Overall** | ... | ... | ... | |
prune.claude/evals/pass-history.jsonlast_5 is all-pass (saturated) or all-fail (broken)stale)status: retired in scenario YAML, update pass-history.json, log in .claude/harness/decision-log.mdmineAnalyze audit logs to propose new eval scenarios from observed failure patterns.
.claude/state/change-log.jsonl (last 100 entries).claude/state/diamond-state-audit.jsonl (all entries)session_id, identify:
a. Correction clusters: 3+ edits to same file in one session (agent struggled)
b. Skill friction: edits to .claude/skills/*/SKILL.md during a session (instructions unclear)
c. Missing test coverage: 5+ files changed with no test file editssource: trace-mining and originating session_idSee .claude/evals/schema.md §Trace Mining Heuristics for pattern-to-eval mappings.
See .claude/evals/schema.md for YAML scenario and JSON result formats.
After writing each result JSON (step 8 of run), also update .claude/evals/pass-history.json:
runs and passes (if passed) for the evaltrue/false to last_5 (trim to keep only last 5)last_run timestampPlace YAML files in .claude/evals/scenarios/<category>/. Define task_prompt, success_criteria, and budget. Set split, status, and source fields per schema.