Help us improve
Share bugs, ideas, or general feedback.
From agenthub
Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
npx claudepluginhub flight505/claude-skills-jesper --plugin agenthubHow this skill is triggered — by the user, by Claude, or both
Slash command
/agenthub:evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
Creates p5.js generative art with seeded randomness, noise fields, and interactive parameter exploration. Use for algorithmic art, flow fields, or particle systems.
Share bugs, ideas, or general feedback.
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)
Run the evaluation command in each agent's worktree:
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}
Output:
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)
For each agent:
git diff {base_branch}...{agent_branch}.agenthub/board/results/agent-{i}-result.mdPresent rankings with justification.
Example LLM judge output for a content task:
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
/hub:merge to merge the winner/hub:merge {session-id} --agent {winner} to be explicit