Skill

testing-llm

From ork

Provides LLM and AI testing patterns including mock responses, DeepEval/RAGAS evaluation, structured output validation, and agentic tests (generator, healer, planner). Use for testing AI features and evaluation pipelines.

Popularity

Parent stars

172

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ork:testing-llm

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadGlobGrepWebFetchWebSearch

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).

Supporting Files

checklists/llm-test-checklist.mdexamples/llm-test-patterns.mdreferences/deepeval-ragas-api.mdreferences/generator-agent.mdreferences/healer-agent.mdreferences/planner-agent.mdrules/_sections.mdrules/llm-evaluation.mdrules/llm-mocking.mdtest-cases.json

SKILL.md

183 lines · ~1.6k tokens

Stats

LanguageTypeScript

Parent stars172

Parent forks15

MaintenanceExcellent

Last CommitMar 29, 2026

Actions

View Source View Plugin View on GitHub View README

LLM & AI Testing Patterns

Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).

Quick Reference

Area	File	Purpose
Rules	`rules/llm-evaluation.md`	DeepEval quality metrics, Pydantic schema validation, timeout testing
Rules	`rules/llm-mocking.md`	Mock LLM responses, VCR.py recording, custom request matchers
Reference	`references/deepeval-ragas-api.md`	Full API reference for DeepEval and RAGAS metrics
Reference	`references/generator-agent.md`	Transforms Markdown specs into Playwright tests
Reference	`references/healer-agent.md`	Auto-fixes failing tests (selectors, waits, dynamic content)
Reference	`references/planner-agent.md`	Explores app and produces Markdown test plans
Checklist	`checklists/llm-test-checklist.md`	Complete LLM testing checklist (setup, coverage, CI/CD)
Example	`examples/llm-test-patterns.md`	Full examples: mocking, structured output, DeepEval, VCR, golden datasets

When to Use This Skill

Testing code that calls LLM APIs (OpenAI, Anthropic, etc.)
Validating RAG pipeline output quality
Setting up deterministic LLM tests in CI
Building evaluation pipelines with quality gates
Applying agentic test patterns (plan -> generate -> heal)

LLM Mock Quick Start

Mock LLM responses for fast, deterministic unit tests:

from unittest.mock import AsyncMock, patch
import pytest

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None

Key rule: NEVER call live LLM APIs in CI. Use mocks for unit tests, VCR.py for integration tests.

DeepEval Quality Quick Start

Validate LLM output quality with multi-dimensional metrics:

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

assert_test(test_case, [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
])

Quality Metrics Thresholds

Metric	Threshold	Purpose
Answer Relevancy	>= 0.7	Response addresses question
Faithfulness	>= 0.8	Output matches context
Hallucination	<= 0.3	No fabricated facts
Context Precision	>= 0.7	Retrieved contexts relevant
Context Recall	>= 0.7	All relevant contexts retrieved

Structured Output Validation

Always validate LLM output with Pydantic schemas:

from pydantic import BaseModel, Field

class LLMResponse(BaseModel):
    answer: str = Field(min_length=1)
    confidence: float = Field(ge=0.0, le=1.0)
    sources: list[str] = Field(default_factory=list)

async def test_structured_output():
    result = await get_llm_response("test query")
    parsed = LLMResponse.model_validate(result)
    assert 0 <= parsed.confidence <= 1.0

VCR.py for Integration Tests

Record and replay LLM API calls for deterministic integration tests:

@pytest.fixture(scope="module")
def vcr_config():
    import os
    return {
        "record_mode": "none" if os.environ.get("CI") else "new_episodes",
        "filter_headers": ["authorization", "x-api-key"],
    }

@pytest.mark.vcr()
async def test_llm_integration():
    response = await llm_client.complete("Say hello")
    assert "hello" in response.content.lower()

Agentic Test Workflow

The three-agent pattern for end-to-end test automation:

Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)

Planner (references/planner-agent.md): Explores your app, produces Markdown test plans from PRDs or natural language requests. Requires seed.spec.ts for app context.
Generator (references/generator-agent.md): Converts Markdown specs into Playwright tests. Actively validates selectors against the running app. Uses semantic locators (getByRole, getByLabel, getByText).
Healer (references/healer-agent.md): Automatically fixes failing tests by replaying failures, inspecting the DOM, and patching locators/waits. Max 3 healing attempts per test.

Edge Cases to Always Test

For every LLM integration, cover these paths:

Empty/null inputs -- empty strings, None values
Long inputs -- truncation behavior near token limits
Timeouts -- fail-open vs fail-closed behavior
Schema violations -- invalid structured output
Prompt injection -- adversarial input resistance
Unicode -- non-ASCII characters in prompts and responses

See checklists/llm-test-checklist.md for the complete checklist.

Anti-Patterns

Anti-Pattern	Correct Approach
Live LLM calls in CI	Mock for unit, VCR for integration
Random seeds	Fixed seeds or mocked responses
Single metric evaluation	3-5 quality dimensions
No timeout handling	Always set < 1s timeout in tests
Hardcoded API keys	Environment variables, filtered in VCR
Asserting only `is not None`	Schema validation + quality metrics

Related Skills

ork:testing-unit — Unit testing fundamentals, AAA pattern
ork:testing-integration — Integration testing for AI pipelines
ork:golden-dataset — Evaluation dataset management

testing-llm

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

testing-llm

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

LLM & AI Testing Patterns

Quick Reference

When to Use This Skill

LLM Mock Quick Start

DeepEval Quality Quick Start

Quality Metrics Thresholds

Structured Output Validation

VCR.py for Integration Tests

Agentic Test Workflow

Edge Cases to Always Test

Anti-Patterns

Related Skills

Similar Skills

LLM & AI Testing Patterns

Quick Reference

When to Use This Skill

LLM Mock Quick Start

DeepEval Quality Quick Start

Quality Metrics Thresholds

Structured Output Validation

VCR.py for Integration Tests

Agentic Test Workflow

Edge Cases to Always Test

Anti-Patterns

Related Skills

Similar Skills