From dspy-agent-skills
Builds DSPy evaluation harnesses with rich-feedback metrics for GEPA optimization. Use when writing metrics, calling dspy.Evaluate, splitting datasets, or debugging optimizer convergence.
How this skill is triggered — by the user, by Claude, or both
Slash command
/dspy-agent-skills:dspy-evaluation-harnessWhen to use
User mentions `dspy.Evaluate`, a "metric", a devset/valset/trainset, evaluation, scoring, or asks why their GEPA optimization isn't converging (almost always: their metric is too thin).
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
The metric is usually more important than the program. For `dspy.GEPA` especially, the quality of **textual feedback** in your metric determines whether optimization converges.
The metric is usually more important than the program. For dspy.GEPA especially, the quality of textual feedback in your metric determines whether optimization converges.
dspy.Prediction(score=..., feedback=...), not a dict. dspy.Evaluate's parallel executor aggregates scores via sum, which breaks on dict outputs (TypeError: unsupported operand type(s) for +: 'int' and 'dict'). dspy.Prediction supports __float__/__add__ and is what GEPA's adapter natively unwraps. A bare float still works for pure dspy.Evaluate scoring, but GEPA needs the score+feedback pair.import dspy
def rich_metric(gold: dspy.Example, pred: dspy.Prediction, trace=None,
pred_name: str | None = None, pred_trace=None):
# 1. Compute sub-scores — multi-axis beats scalar
correctness = 1.0 if _normalize(pred.answer) == _normalize(gold.answer) else 0.0
cited = _has_citation(pred.answer, gold.sources) if hasattr(gold, "sources") else 1.0
concise = 1.0 if len(pred.answer.split()) <= 50 else 0.5
score = 0.6 * correctness + 0.25 * cited + 0.15 * concise
# 2. Write feedback that teaches the optimizer
parts = []
if correctness < 1.0:
parts.append(
f"Answer mismatch. Predicted: {pred.answer!r}. Expected: {gold.answer!r}. "
f"Likely cause: reasoning skipped the units/quantity in the question."
)
if cited < 1.0:
parts.append("Did not ground the claim in the provided sources. Quote a source fragment.")
if concise < 1.0:
parts.append("Answer exceeded 50 words — tighten to one sentence.")
if not parts:
parts.append("Correct, grounded, and concise.")
feedback = " ".join(parts)
return dspy.Prediction(score=score, feedback=feedback)
evaluator = dspy.Evaluate(
devset=valset,
metric=rich_metric,
num_threads=8,
display_progress=True,
display_table=10, # pretty-print first 10 rows
provide_traceback=True, # surface exceptions, don't swallow them
max_errors=5,
failure_score=0.0,
save_as_json="eval_runs/baseline.json",
)
result = evaluator(program)
print("Overall:", result.score)
for example_result in result.results[:3]:
print(example_result)
dspy.Evaluate returns an EvaluationResult with .score (aggregate float) and .results (list of (example, pred, score) tuples).
trainset (for optimization) and valset (for metric-on-optimized-program). A test set you never look at during development is gold.dspy.Example(...).with_inputs("question", "context") — the with_inputs call marks which fields are inputs vs. gold outputs.trainset = [
dspy.Example(question="…", answer="…").with_inputs("question"),
...
]
Combine correctness, faithfulness, format adherence, latency, and cost. Each axis should be a 0–1 float with a written definition. Weight them explicitly; don't hide weights inside magic numbers — make them constants so optimizers can be told to trade off.
# tests/test_dspy_eval.py
import dspy, pytest
from my_program import program, valset, rich_metric
@pytest.fixture(scope="module")
def evaluator():
return dspy.Evaluate(devset=valset, metric=rich_metric, num_threads=8,
display_progress=False, provide_traceback=True)
def test_program_meets_threshold(evaluator):
result = evaluator(program)
assert result.score >= 0.75, f"Regression: {result.score:.3f}"
Run offline in CI with a cached LM (dspy.LM(..., cache=True)) + pre-populated DSPY_CACHEDIR.
track_usage=True on dspy.configure accumulates token counts on predictions (pred.get_lm_usage()).import mlflow; mlflow.dspy.autolog() → traces every prediction.use_wandb=True to dspy.GEPA to log Pareto fronts.save_as_json=...) so you can diff runs.return {"score": s, "feedback": f} (dict) — crashes dspy.Evaluate's parallel aggregator. Use dspy.Prediction(score=s, feedback=f).provide_traceback=False) — you'll blame the LM for a KeyError.dspy-gepa-optimizer.npx claudepluginhub intertwine/dspy-agent-skills --plugin dspy-agent-skillsEvaluates DSPy programs using built-in metrics (answer_exact_match, SemanticF1) and custom scoring functions with parallel execution.
Optimizes DSPy programs using the dspy.GEPA reflective/evolutionary optimizer for complex tasks with rich-feedback metrics. Requires a DSPy module, metric returning dspy.Prediction, trainset, and reflection LM.
Builds type-safe LLM apps in Ruby with DSPy.rb using signatures, modules, agents, tools, and prompt optimization. Useful for predictable AI features, agent systems, and LLM testing.