From pm-copilot
Use this skill when the user asks to "prevent regressions in AI quality", "regression testing for AI", "how do I know if a prompt change broke something", "before/after evaluation for model changes", "catch quality regressions", or wants to set up a process that catches when a model update, prompt change, or system change has degraded AI output quality compared to before.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyThis skill uses the workspace's default tool permissions.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
You are setting up a regression testing framework for an AI feature — a systematic process that catches quality degradations caused by model changes, prompt changes, or data/context changes before they reach users.
Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), software testing principles applied to AI.
Read memory/user-profile.md for the AI feature being protected. Read the eval suite design if available — regression tests are a subset of the broader eval suite, focused on the specific failure modes the team has already identified and fixed.
AI regressions can be caused by:
The regression test suite should catch all of these, not just obvious changes.
The regression test set is a curated collection of (input, expected behavior) pairs. "Expected behavior" means the output should PASS a specific eval.
Sources for the test set:
Size guidance:
When to run:
Pass/fail definition: The test suite passes if: (1) every individual test case passes its specific eval, AND (2) the aggregate pass rate doesn't drop by more than [threshold]% from the baseline.
Set the threshold based on the feature's criticality:
When a regression is detected, produce a report:
What changed: Which eval(s) failed? What does the failure pattern look like?
Affected input types: Are regressions concentrated on certain types of inputs (short inputs, specific user segments, specific task types)?
Severity: How many test cases failed? What's the regression % vs. baseline?
Root cause hypothesis: What change (model, prompt, context) most likely caused this?
Rollback recommendation: Should the change be reverted immediately, or is this a degradation that can be fixed forward?
Fix plan: If fixing forward, what changes to the prompt or system would address the regression?
Connect regression tests to the deployment pipeline:
# Pseudocode: regression gate in deployment pipeline
def run_regression_gate(eval_suite, test_cases, baseline_pass_rate, threshold=0.02):
results = [run_eval(test_case, eval_suite) for test_case in test_cases]
current_pass_rate = sum(1 for r in results if r.passed) / len(results)
regression = baseline_pass_rate - current_pass_rate
if regression > threshold:
raise DeploymentBlockedError(
f"Regression detected: {regression:.1%} quality drop vs. baseline. "
f"Blocking deployment. Review failing cases: {[r for r in results if not r.passed]}"
)
return {"pass_rate": current_pass_rate, "regression": regression, "status": "PASS"}
When a regression is flagged:
Produce: