From magic-powers
Use when building evaluation infrastructure for AI systems — test harnesses, CI pipelines for AI, automated regression detection, golden datasets, and continuous quality measurement.
npx claudepluginhub kienbui1995/magic-powers --plugin magic-powersThis skill uses the workspace's default tool permissions.
- Setting up automated evaluation pipeline for an LLM feature
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Three layers for comprehensive coverage:
Layer 1: Deterministic checks (fast, cheap)
- Output schema validation (JSON structure, required fields)
- Length constraints (too short/long)
- Keyword presence/absence
- PII detection in outputs
Layer 2: Model-as-judge (flexible, moderate cost)
- Use GPT-4/Claude to score outputs on rubric
- Criteria: accuracy, helpfulness, safety, hallucination
- Score 1-5 with reasoning
- Compare to baseline (previous prompt/model)
Layer 3: Human review (ground truth, expensive)
- Sample 5-10% of outputs weekly
- Focus on borderline model-as-judge scores (2-3 out of 5)
- Use disagreements to improve rubric
# .github/workflows/ai-eval.yml
name: AI Eval
on: [push, pull_request]
jobs:
eval:
steps:
- name: Run eval harness
run: python eval/run_harness.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Check pass rate
run: python eval/check_thresholds.py --min-pass-rate 0.85
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: eval-results
path: eval/results/
# A/B test two prompts against golden dataset
results = {
"prompt_v1": run_eval(dataset, prompt_v1),
"prompt_v2": run_eval(dataset, prompt_v2),
}
# Compare: accuracy, cost, latency, safety
winner = select_winner(results, primary_metric="accuracy", cost_constraint=1.2)
| Tool | Use case |
|---|---|
| LangSmith | Tracing + eval for LangChain apps |
| Braintrust | Eval platform, dataset management, CI integration |
| PromptFoo | Open-source prompt testing, CLI-first |
| Weights & Biases | Experiment tracking, model comparison |
| Ragas | RAG-specific evaluation (faithfulness, answer relevance) |
| pytest + custom | Lightweight eval for simple use cases |
Avoid declaring "improvement" without statistical evidence:
from scipy import stats
def is_statistically_significant(baseline_scores, new_scores, alpha=0.05):
"""Two-sample t-test for eval score comparison"""
t_stat, p_value = stats.ttest_ind(baseline_scores, new_scores)
effect_size = (mean(new_scores) - mean(baseline_scores)) / std(baseline_scores)
return StatResult(
significant=p_value < alpha,
p_value=p_value,
effect_size=effect_size, # Cohen's d
practical_significant=abs(effect_size) > 0.2, # small effect
recommendation="ship" if (p_value < alpha and effect_size > 0.2) else "no change"
)
# Minimum sample sizes for reliable conclusions:
# Small effect (d=0.2): n ≥ 197 per group
# Medium effect (d=0.5): n ≥ 52 per group
# Large effect (d=0.8): n ≥ 26 per group
Key rules:
Golden datasets degrade — keep them fresh:
class EvalDatasetManager:
def review_test_cases(self, dataset: Dataset, threshold_days=30):
stale = [tc for tc in dataset if tc.last_reviewed_days > threshold_days]
low_signal = [tc for tc in dataset if tc.pass_rate in (0.0, 1.0)]
# Pass rate 0% = always fails (broken test or impossible), 1% = always passes (too easy)
return DatasetHealthReport(
stale_count=len(stale),
low_signal_count=len(low_signal),
action="review_and_update" if len(stale) + len(low_signal) > len(dataset) * 0.2 else "ok"
)
def add_from_production_failures(self, prod_failures: list[Failure]):
"""Convert production failures into eval test cases"""
for failure in prod_failures:
if failure.confirmed_bug: # human verified
dataset.add(TestCase(
input=failure.input,
expected=failure.expected_output,
source="production_failure",
added_date=today()
))
Dataset health signals:
20% of tests always pass → too easy, add harder cases
10% of tests never pass → broken tests or capability gap, investigate
llm-evaluation (frameworks) and llm-observability (production monitoring)ai-safety-guardrails to add safety checks as eval layer