From dspy-agent-skills
Builds DSPy evaluation harnesses with rich-feedback metrics for GEPA optimization. Use for writing metric functions, dspy.Evaluate, dev/val splits, optimizer debugging, CI evals.
npx claudepluginhub intertwine/dspy-agent-skills --plugin dspy-agent-skillsThis skill uses the workspace's default tool permissions.
The metric is usually more important than the program. For `dspy.GEPA` especially, the quality of **textual feedback** in your metric determines whether optimization converges.
Evaluates DSPy programs using dspy.evaluate.Evaluate with built-in metrics like answer_exact_match and SemanticF1 or custom metrics, via parallel execution for performance measurement, baselines, and comparisons.
Optimizes DSPy programs using dspy.GEPA reflective optimizer with rich metric feedback and Pareto frontier. Use when optimizing/compiling DSPy modules with metric and train/val sets.
Builds type-safe LLM apps in Ruby with DSPy.rb using signatures, modules, agents, tools, and prompt optimization. Useful for predictable AI features, agent systems, and LLM testing.
Share bugs, ideas, or general feedback.
The metric is usually more important than the program. For dspy.GEPA especially, the quality of textual feedback in your metric determines whether optimization converges.
dspy.Prediction(score=..., feedback=...), not a dict. dspy.Evaluate's parallel executor aggregates scores via sum, which breaks on dict outputs (TypeError: unsupported operand type(s) for +: 'int' and 'dict'). dspy.Prediction supports __float__/__add__ and is what GEPA's adapter natively unwraps. A bare float still works for pure dspy.Evaluate scoring, but GEPA needs the score+feedback pair.import dspy
def rich_metric(gold: dspy.Example, pred: dspy.Prediction, trace=None,
pred_name: str | None = None, pred_trace=None):
# 1. Compute sub-scores — multi-axis beats scalar
correctness = 1.0 if _normalize(pred.answer) == _normalize(gold.answer) else 0.0
cited = _has_citation(pred.answer, gold.sources) if hasattr(gold, "sources") else 1.0
concise = 1.0 if len(pred.answer.split()) <= 50 else 0.5
score = 0.6 * correctness + 0.25 * cited + 0.15 * concise
# 2. Write feedback that teaches the optimizer
parts = []
if correctness < 1.0:
parts.append(
f"Answer mismatch. Predicted: {pred.answer!r}. Expected: {gold.answer!r}. "
f"Likely cause: reasoning skipped the units/quantity in the question."
)
if cited < 1.0:
parts.append("Did not ground the claim in the provided sources. Quote a source fragment.")
if concise < 1.0:
parts.append("Answer exceeded 50 words — tighten to one sentence.")
if not parts:
parts.append("Correct, grounded, and concise.")
feedback = " ".join(parts)
return dspy.Prediction(score=score, feedback=feedback)
evaluator = dspy.Evaluate(
devset=valset,
metric=rich_metric,
num_threads=8,
display_progress=True,
display_table=10, # pretty-print first 10 rows
provide_traceback=True, # surface exceptions, don't swallow them
max_errors=5,
failure_score=0.0,
save_as_json="eval_runs/baseline.json",
)
result = evaluator(program)
print("Overall:", result.score)
for example_result in result.results[:3]:
print(example_result)
dspy.Evaluate returns an EvaluationResult with .score (aggregate float) and .results (list of (example, pred, score) tuples).
trainset (for optimization) and valset (for metric-on-optimized-program). A test set you never look at during development is gold.dspy.Example(...).with_inputs("question", "context") — the with_inputs call marks which fields are inputs vs. gold outputs.trainset = [
dspy.Example(question="…", answer="…").with_inputs("question"),
...
]
Combine correctness, faithfulness, format adherence, latency, and cost. Each axis should be a 0–1 float with a written definition. Weight them explicitly; don't hide weights inside magic numbers — make them constants so optimizers can be told to trade off.
# tests/test_dspy_eval.py
import dspy, pytest
from my_program import program, valset, rich_metric
@pytest.fixture(scope="module")
def evaluator():
return dspy.Evaluate(devset=valset, metric=rich_metric, num_threads=8,
display_progress=False, provide_traceback=True)
def test_program_meets_threshold(evaluator):
result = evaluator(program)
assert result.score >= 0.75, f"Regression: {result.score:.3f}"
Run offline in CI with a cached LM (dspy.LM(..., cache=True)) + pre-populated DSPY_CACHEDIR.
track_usage=True on dspy.configure accumulates token counts on predictions (pred.get_lm_usage()).import mlflow; mlflow.dspy.autolog() → traces every prediction.use_wandb=True to dspy.GEPA to log Pareto fronts.save_as_json=...) so you can diff runs.return {"score": s, "feedback": f} (dict) — crashes dspy.Evaluate's parallel aggregator. Use dspy.Prediction(score=s, feedback=f).provide_traceback=False) — you'll blame the LM for a KeyError.dspy-gepa-optimizer.