Skill

eval

Install
1
Install the plugin
$
npx claudepluginhub alirezarezvani/claude-skills --plugin agenthub

Want just this skill?

Add to a custom plugin, then install with one command.

Description

Evaluate and rank agent results by metric or LLM judge for an AgentHub session.

Tool Access

This skill uses the workspace's default tool permissions.

Skill Content

/hub:eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

Usage

/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)

What It Does

Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}

Output:

RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

LLM Judge Mode (no eval command, or --judge flag)

For each agent:

  1. Get the diff: git diff {base_branch}...{agent_branch}
  2. Read the agent's result post from .agenthub/board/results/agent-{i}-result.md
  3. Compare all diffs and rank by:
    • Correctness — Does it solve the task?
    • Simplicity — Fewer lines changed is better (when equal correctness)
    • Quality — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:

RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)

Hybrid Mode

  1. Run metric evaluation first
  2. If top agents are within 10% of each other, use LLM judge to break ties
  3. Present both metric and qualitative rankings

After Eval

  1. Update session state:
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
  1. Tell the user:
    • Ranked results with winner highlighted
    • Next step: /hub:merge to merge the winner
    • Or /hub:merge {session-id} --agent {winner} to be explicit
Stats
Stars5871
Forks688
Last CommitMar 17, 2026
Actions

Similar Skills

cache-components

Expert guidance for Next.js Cache Components and Partial Prerendering (PPR). **PROACTIVE ACTIVATION**: Use this skill automatically when working in Next.js projects that have `cacheComponents: true` in their next.config.ts/next.config.js. When this config is detected, proactively apply Cache Components patterns and best practices to all React Server Component implementations. **DETECTION**: At the start of a session in a Next.js project, check for `cacheComponents: true` in next.config. If enabled, this skill's patterns should guide all component authoring, data fetching, and caching decisions. **USE CASES**: Implementing 'use cache' directive, configuring cache lifetimes with cacheLife(), tagging cached data with cacheTag(), invalidating caches with updateTag()/revalidateTag(), optimizing static vs dynamic content boundaries, debugging cache issues, and reviewing Cache Component implementations.

138.4k