Help us improve
Share bugs, ideas, or general feedback.
From north-starr-genai
Generate executable pytest test files for AI outputs. Produces assertion-based tests for deterministic AI components (classification, extraction, routing, structured output) that run in CI/CD. Complements /eval-suite which produces statistical evaluation datasets for non-deterministic outputs.
npx claudepluginhub selcukyucel/north-starr-genai --plugin north-starr-genaiHow this skill is triggered — by the user, by Claude, or both
Slash command
/north-starr-genai:ai-test <prompt name, pipeline name, or eval suite path><prompt name, pipeline name, or eval suite path>The summary Claude sees in its skill listing — used to decide when to auto-load this skill
`/eval-suite` produces evaluation datasets (JSONL) scored by rubrics — ideal for generative, non-deterministic AI outputs where statistical thresholds matter. But many AI components produce **deterministic, structured outputs** (classification labels, extracted entities, routing decisions, structured JSON) where hard assertions are appropriate and valuable.
Guides technical evaluation of code review feedback: read fully, restate for understanding, verify against codebase, respond with reasoning or pushback before implementing.
Share bugs, ideas, or general feedback.
/eval-suite produces evaluation datasets (JSONL) scored by rubrics — ideal for generative, non-deterministic AI outputs where statistical thresholds matter. But many AI components produce deterministic, structured outputs (classification labels, extracted entities, routing decisions, structured JSON) where hard assertions are appropriate and valuable.
This skill generates executable pytest files with fixtures, assertions, and markers that developers can commit to the repo and run in CI/CD. It bridges the gap between eval datasets and automated testing.
Use this skill when:
Do NOT use this skill when:
/eval-suite + /prompt-test insteadBoth can coexist: Use /ai-test for deterministic contract tests (format, classification, routing) and /eval-suite for quality evaluation on the same component.
Before starting, read the root CLAUDE.md (if it exists). If ## How to Approach Tasks appears AFTER ## Tech Stack or ## Architecture, the file has an outdated section order. Silently run /genai-sync to fix it before continuing with this skill.
The user provides one of:
.plans/EVAL-<name>/) to convert golden examples into executable testsActions:
.plans/EVAL-<name>/) — golden examples with expected outputs are the best source for test casestests/, test_*.py, *_test.py) to match the project's test conventionsCLAUDE.md, AGENTS.md) for test runner, framework, and conventionsIdentify these properties:
response_format)? If not deterministic, warn the user and suggest /eval-suite instead.Based on the component type, select which assertion patterns apply:
| Output Type | Assertion Pattern | Example |
|---|---|---|
| Classification label | assert result.category == "billing" | Customer inquiry classifier |
| Structured JSON | assert result.field == value per field | Entity extraction, form parsing |
| Boolean decision | assert result.decision is True | Content moderation, eligibility check |
| Routing target | assert result.route == "escalate" | Ticket routing, intent detection |
| Enum/literal | assert result.status in ["approved", "rejected"] | Decision pipeline |
| Extraction with list | assert "entity" in result.entities | NER, keyword extraction |
| Numeric output | assert abs(result.score - expected) < tolerance | Scoring, rating systems |
Soft assertions (for semi-deterministic outputs):
Some fields are deterministic (category, required fields) while others vary (response text, explanation). Split assertions:
# Hard assertions — must match exactly
assert result.category == "billing"
assert result.priority == "P2"
# Soft assertions — check properties, not exact text
assert len(result.response) > 10
assert "refund" in result.response.lower()
Source 1: Existing eval suite (preferred)
If .plans/EVAL-<name>/golden.jsonl exists, convert golden examples:
expected_output becomes a test casescoring field indicates which assertions to generatecategory field maps to pytest markers (@pytest.mark.happy_path, @pytest.mark.edge_case)Source 2: Existing test data
Check for fixture files (events/, fixtures/, test_data/, tests/fixtures/) — reuse existing test inputs.
Source 3: Generate from the component If no test data exists, analyze the prompt and output schema to generate test cases:
For each test case, determine:
Before generating code, read the project's existing test setup:
Check for:
tests/, test/, root-level test_*.pyconftest.py, fixture files, factory functionstest_*.py or *_test.py@pytest.mark.slow, @pytest.mark.integration.env loading, mock clients, test API keysDefaults if no conventions found:
tests/ai/test_<component_name>.pyconftest.py in the test directoryGenerate the test file(s) following the project's conventions. The output is real, executable Python code — not pseudocode or templates.
File structure:
"""
AI output tests for <component name>.
Generated by /ai-test from <source: eval suite / manual / generated>.
Tests deterministic AI outputs with hard assertions.
For statistical quality evaluation, see /eval-suite and /prompt-test.
"""
import json
import pytest
from pathlib import Path
# Project-specific imports (adapt to the actual codebase)
from <module> import <function_or_class>
# --- Fixtures ---
@pytest.fixture
def ai_client():
"""Initialize the AI component under test."""
# Adapt to the project's initialization pattern
return <ComponentClass>()
@pytest.fixture
def load_event():
"""Load test event from fixture file."""
def _load(filename: str) -> dict:
with open(Path(__file__).parent / "fixtures" / filename) as f:
return json.load(f)
return _load
# --- Classification Tests ---
class TestClassification:
"""Test output classification accuracy."""
@pytest.mark.happy_path
def test_<category>_categorization(self, ai_client, load_event):
"""<Category> inputs should be classified as '<category>'."""
event = load_event("<category>_test.json")
result = ai_client.process(event["message"])
assert result.category == "<expected_category>"
assert len(result.response) > 10
# ... one test per category/label
# --- Schema Validation Tests ---
class TestOutputSchema:
"""Test output format and schema compliance."""
@pytest.mark.schema
def test_output_has_required_fields(self, ai_client):
"""Every output must contain all required fields."""
result = ai_client.process("Sample input")
assert hasattr(result, "category")
assert hasattr(result, "response")
@pytest.mark.schema
def test_category_is_valid_enum(self, ai_client):
"""Category must be one of the allowed values."""
result = ai_client.process("Sample input")
assert result.category in ["complaint", "feature_request", "billing", "other"]
# --- Edge Case Tests ---
class TestEdgeCases:
"""Test boundary and ambiguous inputs."""
@pytest.mark.edge_case
def test_ambiguous_input(self, ai_client):
"""Ambiguous input should still produce a valid category."""
result = ai_client.process("<ambiguous input>")
assert result.category in ["<valid_categories>"]
assert len(result.response) > 0
# --- Regression Anchors ---
class TestRegressionAnchors:
"""Critical outputs that must not change between versions.
Any failure here is a regression — investigate before shipping.
"""
@pytest.mark.regression
def test_<critical_case>(self, ai_client):
"""<Why this case is critical>."""
result = ai_client.process("<critical input>")
assert result.category == "<expected>"
Generation rules:
test_billing_inquiry_classified_as_billing, not test_1@pytest.mark.happy_path, @pytest.mark.edge_case, @pytest.mark.regression, @pytest.mark.schema for selective runningFor each test case, create the corresponding fixture file:
JSON fixture format:
{
"id": "billing-001",
"message": "<the complete test input>",
"expected": {
"category": "billing",
"response_contains": ["refund", "charge"]
},
"notes": "<why this test case exists>"
}
Write fixtures to the test directory (e.g., tests/ai/fixtures/). One fixture file per test case, or one file per category with multiple cases.
If converting from an existing eval suite, preserve the original IDs so test failures can be traced back to the eval suite.
If the project doesn't have a conftest.py that handles AI test setup, generate one:
"""Shared fixtures for AI output tests."""
import os
import pytest
from dotenv import load_dotenv
@pytest.fixture(scope="session", autouse=True)
def load_env():
"""Load environment variables for AI API calls."""
load_dotenv()
@pytest.fixture(scope="session")
def require_api_key():
"""Skip tests if API key is not set."""
if not os.getenv("OPENAI_API_KEY"):
pytest.skip("OPENAI_API_KEY not set — skipping AI tests")
Actions:
tests/ai/test_<component_name>.py — the test codetests/ai/fixtures/<name>.json — test fixture filestests/ai/conftest.py — shared fixtures (if needed, don't overwrite existing)tests/ai/EVAL_MAPPING.md — maps test IDs to eval suite IDs for traceabilityAfter writing the test files, attempt to run them:
python -m py_compile tests/ai/test_<name>.pypytest tests/ai/test_<name>.py --collect-only (verifies pytest can discover the tests)pytest tests/ai/test_<name>.py -v (actually execute)@pytest.mark.flaky or increase tolerance)AI Tests Generated: <component name>
─────────────────────────────────────
Directory: tests/ai/
Source: <eval suite / generated / manual>
Files:
test_<name>.py -- <N> tests (<N> happy path, <N> edge case, <N> regression, <N> schema)
fixtures/ -- <N> fixture files
conftest.py -- shared fixtures (API key handling, env loading)
EVAL_MAPPING.md -- maps test IDs to eval suite (if applicable)
Test Results:
Collected: <N>
Passed: <N>
Failed: <N>
Skipped: <N>
Run with:
pytest tests/ai/test_<name>.py -v # all tests
pytest tests/ai/ -m happy_path # happy path only
pytest tests/ai/ -m regression # regression anchors only
pytest tests/ai/ -m "not slow" # skip slow tests
Next steps:
1. Review generated tests — adjust assertions where needed
2. Add to CI/CD pipeline
3. Run /eval-suite for statistical quality evaluation (complements these tests)
4. As you find production bugs, add regression tests with /ai-test
| Skill | Relationship |
|---|---|
/eval-suite | Produces JSONL datasets that /ai-test can convert into executable tests. /eval-suite is for statistical evaluation; /ai-test is for hard assertions. Use both. |
/prompt-test | Runs interactive evaluation using rubrics. /ai-test produces code that runs without human interaction. |
/baseline | Captures performance metrics. /ai-test catches regressions with hard assertions. |
/autoimprove | Iteratively optimizes prompts. Run /ai-test after each round to verify deterministic outputs didn't regress. |
/eval-suite.response_format, JSON mode), schema validation tests are nearly free — always include them.@pytest.mark.regression before fixing. This ensures the fix sticks.