dspy-evaluation-harness | dspy-agent-skills | ClaudePluginHub

Skill

dspy-evaluation-harness

From dspy-agent-skills

Builds DSPy evaluation harnesses with rich-feedback metrics for GEPA optimization. Use for writing metric functions, dspy.Evaluate, dev/val splits, optimizer debugging, CI evals.

$

npx claudepluginhub intertwine/dspy-agent-skills --plugin dspy-agent-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The metric is usually more important than the program. For `dspy.GEPA` especially, the quality of **textual feedback** in your metric determines whether optimization converges.

Supporting Assets

example_metric.pyreference.md

SKILL.md

Similar Skills

dspy-evaluation-suite

50

Evaluates DSPy programs using dspy.evaluate.Evaluate with built-in metrics like answer_exact_match and SemanticF1 or custom metrics, via parallel execution for performance measurement, baselines, and comparisons.

4 tools

dspy-gepa-optimizer

206

Optimizes DSPy programs using dspy.GEPA reflective optimizer with rich metric feedback and Pareto frontier. Use when optimizing/compiling DSPy modules with metric and train/val sets.

3 files

dspy-agent-skills

dspy-ruby

35

Builds type-safe LLM apps in Ruby with DSPy.rb using signatures, modules, agents, tools, and prompt optimization. Useful for predictable AI features, agent systems, and LLM testing.

8 files

compound-engineering

Stats

Stars206

Forks15

Last CommitApr 28, 2026

Actions

View Source View Plugin View on GitHub View README

Tags

dspy-evaluation

gepa-optimization

optimizer-debugging

Help us improve

Share bugs, ideas, or general feedback.

DSPy Evaluation Harness (3.2.x)

The metric is usually more important than the program. For dspy.GEPA especially, the quality of textual feedback in your metric determines whether optimization converges.

Two rules

Return a dspy.Prediction(score=..., feedback=...), not a dict. dspy.Evaluate's parallel executor aggregates scores via sum, which breaks on dict outputs (TypeError: unsupported operand type(s) for +: 'int' and 'dict'). dspy.Prediction supports __float__/__add__ and is what GEPA's adapter natively unwraps. A bare float still works for pure dspy.Evaluate scoring, but GEPA needs the score+feedback pair.
Separate valset. Never optimize and evaluate on the same examples. Optimizers overfit fast.

Canonical rich-feedback metric

import dspy

def rich_metric(gold: dspy.Example, pred: dspy.Prediction, trace=None,
                pred_name: str | None = None, pred_trace=None):
    # 1. Compute sub-scores — multi-axis beats scalar
    correctness = 1.0 if _normalize(pred.answer) == _normalize(gold.answer) else 0.0
    cited = _has_citation(pred.answer, gold.sources) if hasattr(gold, "sources") else 1.0
    concise = 1.0 if len(pred.answer.split()) <= 50 else 0.5
    score = 0.6 * correctness + 0.25 * cited + 0.15 * concise

    # 2. Write feedback that teaches the optimizer
    parts = []
    if correctness < 1.0:
        parts.append(
            f"Answer mismatch. Predicted: {pred.answer!r}. Expected: {gold.answer!r}. "
            f"Likely cause: reasoning skipped the units/quantity in the question."
        )
    if cited < 1.0:
        parts.append("Did not ground the claim in the provided sources. Quote a source fragment.")
    if concise < 1.0:
        parts.append("Answer exceeded 50 words — tighten to one sentence.")
    if not parts:
        parts.append("Correct, grounded, and concise.")
    feedback = " ".join(parts)

    return dspy.Prediction(score=score, feedback=feedback)

Canonical harness

evaluator = dspy.Evaluate(
    devset=valset,
    metric=rich_metric,
    num_threads=8,
    display_progress=True,
    display_table=10,          # pretty-print first 10 rows
    provide_traceback=True,    # surface exceptions, don't swallow them
    max_errors=5,
    failure_score=0.0,
    save_as_json="eval_runs/baseline.json",
)
result = evaluator(program)
print("Overall:", result.score)
for example_result in result.results[:3]:
    print(example_result)

dspy.Evaluate returns an EvaluationResult with .score (aggregate float) and .results (list of (example, pred, score) tuples).

Dataset hygiene

Size: 20–50 examples is enough for GEPA's reflective loop; 100–500 for MIPROv2-style bootstrapping.
Split: hand-curate two disjoint sets — trainset (for optimization) and valset (for metric-on-optimized-program). A test set you never look at during development is gold.
Representativeness beats size. Include edge cases, ambiguity, adversarial inputs.
Build dspy.Example(...).with_inputs("question", "context") — the with_inputs call marks which fields are inputs vs. gold outputs.

trainset = [
    dspy.Example(question="…", answer="…").with_inputs("question"),
    ...
]

Multi-axis metrics (recommended)

Combine correctness, faithfulness, format adherence, latency, and cost. Each axis should be a 0–1 float with a written definition. Weight them explicitly; don't hide weights inside magic numbers — make them constants so optimizers can be told to trade off.

CI-ready eval suite

# tests/test_dspy_eval.py
import dspy, pytest
from my_program import program, valset, rich_metric

@pytest.fixture(scope="module")
def evaluator():
    return dspy.Evaluate(devset=valset, metric=rich_metric, num_threads=8,
                         display_progress=False, provide_traceback=True)

def test_program_meets_threshold(evaluator):
    result = evaluator(program)
    assert result.score >= 0.75, f"Regression: {result.score:.3f}"

Run offline in CI with a cached LM (dspy.LM(..., cache=True)) + pre-populated DSPY_CACHEDIR.

Tracing & observability

track_usage=True on dspy.configure accumulates token counts on predictions (pred.get_lm_usage()).
MLflow: import mlflow; mlflow.dspy.autolog() → traces every prediction.
W&B: pass use_wandb=True to dspy.GEPA to log Pareto fronts.
Always log eval results to a versioned JSON file (save_as_json=...) so you can diff runs.

Anti-patterns

Scalar-only metrics (float but no feedback) when using GEPA — wasted signal.
return {"score": s, "feedback": f} (dict) — crashes dspy.Evaluate's parallel aggregator. Use dspy.Prediction(score=s, feedback=f).
Exact-match metrics on open-ended generation tasks — use semantic or LM-as-judge scoring.
Evaluating on the trainset — optimistic by 10–30 points.
Silently swallowing exceptions (provide_traceback=False) — you'll blame the LM for a KeyError.
Changing the metric mid-experiment without re-baselining — prior numbers become incomparable.

Next

Feed this metric into dspy-gepa-optimizer.
Full harness pattern → reference.md.
Runnable example → example_metric.py.