Search everything...

Skill

eval-driven-dev

Sets up eval-based QA for Python LLM apps with pixie-qa: instruments code, builds golden datasets, runs end-to-end tests, iterates on failures.

Python

testing

ai-ml

From awesome-copilot

Install

Run in your terminal

npx claudepluginhub ctr26/dotfiles --plugin awesome-copilot

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

references/dataset-generation.md

references/eval-tests.md

references/instrumentation.md

references/investigation.md

references/pixie-api.md

references/run-harness-patterns.md

references/understanding-app.md

Skill Content

Similar Skills

context7-mcp

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

51.8k

context7-mcp

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7

50.8k

context7-cli

3 files

Uses ctx7 CLI to fetch current library docs, manage AI coding skills (install/search/generate), and configure Context7 MCP for AI editors.

context7

50.8k

Stats

Stars28207

Forks3270

Last CommitMar 29, 2026

Actions

View Source View Plugin View on GitHub View README

eval-driven-dev

Sets up eval-based QA for Python LLM apps with pixie-qa: instruments code, builds golden datasets, runs end-to-end tests, iterates on failures.

Install

Run in your terminal

npx claudepluginhub ctr26/dotfiles --plugin awesome-copilot

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

references/dataset-generation.md

references/eval-tests.md

references/instrumentation.md

references/investigation.md

references/pixie-api.md

references/run-harness-patterns.md

references/understanding-app.md

Skill Content

Eval-Driven Development for Python LLM Applications

You're building an automated QA pipeline that tests a Python application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via pixie test.

What you're testing is the app itself — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of assertEqual — but the thing under test is the app's code, not the LLM.

What's in scope: the app's entire code path from entry point to response — never mock or skip any part of it. What's out of scope: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — mock these to control inputs and reduce flakiness.

The deliverable is a working pixie test run with real scores — not a plan, not just instrumentation, not just a dataset.

This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.

Before you start

Run the following to keep the skill and package up to date. If any command fails or is blocked by the environment, continue — do not let failures here block the rest of the workflow.

Update the skill:

npx skills update

Upgrade the pixie-qa package

Make sure the python virtual environment is active and use the project's package manager:

# uv project (uv.lock exists):
uv add pixie-qa --upgrade

# poetry project (poetry.lock exists):
poetry add pixie-qa@latest

# pip / no lock file:
pip install --upgrade pixie-qa

The workflow

Follow Steps 1–5 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.

Two modes:

Setup ("set up evals", "add tests", "set up QA"): Complete Steps 1–5. After the test run, report results and ask whether to iterate.
Iteration ("fix", "improve", "debug"): Complete Steps 1–5 if not already done, then do one round of Step 6.

If ambiguous: default to setup.

Step 1: Understand the app and define eval criteria

Read the source code to understand:

How it runs — entry point, startup, config/env vars
The real entry point — how a real user invokes the app (HTTP endpoint, CLI, function call). This is what the eval must exercise — not an inner function that bypasses the request pipeline.
The request pipeline — trace the full path from entry point to response. What middleware, routing, state management, prompt assembly, retrieval, or formatting happens along the way? All of this is under test.
External dependencies (both directions) — identify every external system the app talks to (databases, APIs, caches, queues, file systems, speech services). For each, understand:
- Data flowing IN (external → app): what data does the app read from this system? What shapes, types, realistic values? You'll make up this data for eval scenarios.
- Data flowing OUT (app → external): what does the app write, send, or mutate in this system? These are side-effects that evaluations may need to verify (e.g., "did the app create the right calendar entry?", "did it send the correct transfer request?").
- How to mock it — look for abstract base classes, protocols, or constructor-injected backends (e.g., TranscriptionBackend, SynthesisBackend, StorageBackend). These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use unittest.mock.patch at the module boundary.
Use cases — distinct scenarios, what good/bad output looks like

Read references/understanding-app.md for detailed guidance on mapping data flows and the MEMORY.md template.

Write your findings to pixie_qa/MEMORY.md before moving on. Include:

The entry point and the full request pipeline
Every external dependency, what it provides/receives, and how you'll mock it
The testability seams (pluggable interfaces, patchable module-level objects)

Determine high-level, application-specific eval criteria:

Good criteria are specific to the app's purpose. Examples:

Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation (under 3 sentences)?", "Does the agent route to the correct department based on the caller's request?"
Research report generator: "Does the report address all sub-questions in the query?", "Are claims supported by the retrieved sources?", "Is the report structured with clear sections?"
RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when the context doesn't contain the answer?"

Bad criteria are generic evaluator names dressed up as requirements. Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.

At this stage, don't pick evaluator classes or thresholds. That comes later in Step 5, after you've seen the real data shape.

Record the criteria in pixie_qa/MEMORY.md and continue.

Checkpoint: MEMORY.md written with app understanding + eval criteria. Proceed to Step 2.

Step 2: Instrument and observe a real run

Why this step: You need to see the actual data flowing through the app before you can build anything. This step serves two goals:

Learn the data shapes — what data flows in from external dependencies, and what side-effects flow out? What types, structures, realistic values? You'll need to make up this data for eval scenarios later.
Verify instrumentation captures what evaluators need — do the traces contain the data required to assess each eval criterion from Step 1? If a criterion is "does the agent route to the correct department," the trace must capture the routing decision.

This is a normal app run with instrumentation — no mocks, no patches.

2a. Decide what to instrument

This is a reasoning step, not a coding step. Look at your eval criteria from Step 1 and your understanding of the codebase, and determine what data the evaluators will need:

For each eval criterion, ask: what observable data would prove this criterion is met or violated?
Map that data to code locations — which functions produce, consume, or transform that data?
Those functions need @observe — so their inputs and outputs are captured in traces.

Examples:

Eval criterion	Data needed	What to instrument
"Routes to correct department"	The routing decision (which department was chosen)	The routing/dispatch function
"Responses grounded in retrieved context"	The retrieved documents + the final response	The retrieval function AND the response function
"Verifies caller identity before transfer"	Whether identity check happened, transfer decision	The identity verification function AND the transfer function
"Concise phone-friendly responses"	The final response text	The function that produces the LLM response

LLM provider calls (OpenAI, Anthropic, etc.) are auto-captured — enable_storage() activates OpenInference instrumentors that automatically trace every LLM API call with full input messages, output messages, token usage, and model parameters. You do NOT need @observe on the function that calls client.chat.completions.create() just to see the LLM interaction.

Use @observe for application-level functions whose inputs, outputs, or intermediate states your evaluators need but that aren't visible from the LLM call alone. Examples: the app's entry-point function (to capture what the user sent and what the app returned), retrieval functions (to capture what context was fetched), routing functions (to capture dispatch decisions).

enable_storage() goes at application startup. Read references/instrumentation.md for the full rules, code patterns, and anti-patterns for adding instrumentation.

2b. Add instrumentation and run the app

Add @observe to the functions you identified in 2a. Then run the app normally — with its real external dependencies, or by manually interacting with it — to produce a reference trace. Do NOT mock or patch anything. This is an observation run.

If the app can't run without infrastructure you don't have (a real database, third-party service credentials, etc.), use the simplest possible approach to get it running — a local Docker container, a test account, or ask the user for help. The goal is one real trace.

uv run pixie trace list
uv run pixie trace last

2c. Examine the reference trace

Study the trace data carefully. This is your blueprint for everything that follows. Document:

Data from external dependencies (inbound) — What did the app read from databases, APIs, caches? What are the shapes, types, and realistic value ranges? This is what you'll make up in eval_input for the dataset.
Side-effects (outbound) — What did the app write to, send to, or mutate in external systems? These need to be captured by mocks and may be part of eval_output for verification.
Intermediate states — What did the instrumentation capture beyond the final output? Tool calls, retrieved documents, routing decisions? Are these sufficient to evaluate every criterion from Step 1?
The eval_input / eval_output structure — What does the @observe-decorated function receive as input and produce as output? Note the exact field names, types, and nesting.

Check instrumentation completeness: For each eval criterion from Step 1, verify the trace contains the data needed to evaluate it. If not, add more @observe decorators and re-run.

Do not proceed until you understand the data shape and have confirmed the traces capture everything your evaluators need.

Checkpoint: Instrumentation added based on eval criteria. Reference trace captured with real data. For each criterion, confirm the trace contains the data needed to evaluate it. Proceed to Step 3.

Step 3: Write a utility function to run the full app end-to-end

Why this step: You need a function that test cases can call. Given an eval_input (app input + mock data for external dependencies), it starts the real application with external dependencies patched, sends the input through the app's real entry point, and returns the eval_output (app response + captured side-effects).

The contract

run_app(eval_input) → eval_output

eval_input = application input (what the user sends) + data from external dependencies (what databases/APIs would return)
eval_output = application output (what the user sees) + captured side-effects (what the app wrote to external systems, captured by mocks) + captured intermediate states (tool calls, routing decisions, etc., captured by instrumentation)

How to implement

Patch external dependencies — use the mocking plan from Step 1 item 4. For each external dependency, either inject a mock implementation of its interface (cleanest) or unittest.mock.patch the module-level client. The mock returns data from eval_input and captures side-effects for eval_output.
Call the app through its real entry point — the same way a real user or client would invoke it. Look at how the app is started: if it's a web server (FastAPI, Flask), use TestClient or HTTP requests. If it's a CLI, use subprocess. If it's a standalone function with no server or middleware, import and call it directly.
Collect the response — the app's output becomes eval_output, along with any side-effects captured by mock objects.

Read references/run-harness-patterns.md for concrete examples of entry point invocation for different app types.

Do NOT call an inner function like agent.respond() directly just because it's simpler. The whole point is to test the app's real code path — request handling, state management, prompt assembly, routing. When you call an inner function directly, you skip all of that, and the test has to reimplement it. Now you're testing test code, not app code.

Verify

Take the eval_input from your Step 2 reference trace and feed it to the utility function. The outputs won't match word-for-word (non-deterministic), but verify:

Same structure — same fields present, same types, same nesting
Same code path — same routing decisions, same intermediate states captured
Sensible values — eval_output fields have real, meaningful data (not null, not empty, not error messages)

If it fails after two attempts, stop and ask the user for help.

Checkpoint: Utility function implemented and verified. When fed the reference trace's eval_input, it produces eval_output with the same structure and exercises the same code path. Proceed to Step 4.

Step 4: Build the dataset

Why this step: The dataset is a collection of eval_input items (made up by you) that define the test scenarios. Each item may also carry case-specific expectations. The eval_output is NOT pre-populated in the dataset — it's produced at test time by the utility function from Step 3.

4a. Determine verification and expectations

Before generating data, decide how each eval criterion from Step 1 will be checked.

Examine the reference trace from Step 2 and identify:

Structural constraints you can verify with code — JSON schema, required fields, value types, enum ranges, string length bounds. These become validation checks on your generated eval_inputs.
Semantic constraints that require judgment — "the mock customer profile should be realistic", "the conversation history should be topically coherent". Apply these yourself when crafting the data.
Which criteria are universal vs. case-specific:
- Universal criteria apply to ALL test cases the same way → implement in the test function (e.g., "responses must be under 3 sentences", "must not hallucinate information not in context")
- Case-specific criteria vary per test case → carry as expected_output in the dataset item (e.g., "should mention the caller's appointment on Tuesday", "should route to billing department")

4b. Generate eval_input items

Create eval_input items that match the data shape from the reference trace:

Application inputs (user queries, requests) — make these up to cover the scenarios you identified in Step 1
External dependency data (database records, API responses, cache entries) — make these up in the exact shape you observed in the reference trace

Each dataset item contains:

eval_input: the made-up input data (app input + external dependency data)
expected_output: case-specific expectation text (optional — only for test cases with expectations beyond the universal criteria). This is a reference for evaluation, not an exact expected answer.

At test time, eval_output is produced by the utility function from Step 3 and is not stored in the dataset itself. Read references/dataset-generation.md for the dataset creation API, data shape matching, expected_output strategy, and validation checklist.

4c. Validate the dataset

After building:

Execute build_dataset.py — don't just write it, run it
Verify structural constraints — each eval_input matches the reference trace's schema (same fields, same types)
Verify diversity — items have meaningfully different inputs, not just minor variations
Verify case-specific expectations — expected_output values are specific and testable, not vague
For conversational apps, include items with conversation history

Checkpoint: Dataset created with diverse eval_inputs matching the reference trace's data shape. Proceed to Step 5.

Step 5: Write and run eval tests

Why this step: With the utility function built and the dataset ready, writing tests is straightforward — wire up the function, choose evaluators for each criterion, and run.

5a. Map criteria to evaluators

For each eval criterion from Step 1, decide how to evaluate it:

Can it be checked with a built-in evaluator? (factual correctness → FactualityEval, exact match → ExactMatchEval, RAG faithfulness → FaithfulnessEval)
Does it need a custom evaluator? Most app-specific criteria do — use create_llm_evaluator with a prompt that operationalizes the criterion.
Is it universal or case-specific? Universal criteria go in the test function. Case-specific criteria use expected_output from the dataset.

For open-ended LLM text, never use ExactMatchEval — LLM outputs are non-deterministic.

AnswerRelevancyEval is RAG-only — it requires a context value in the trace. Returns 0.0 without it. For general relevance without RAG, use create_llm_evaluator with a custom prompt.

Read references/eval-tests.md for the evaluator catalog, custom evaluator examples, and the test file boilerplate.

5b. Write the test file and run

The test file wires together: a runnable (calls your utility function from Step 3), a reference to the dataset, and the evaluators you chose.

Read references/eval-tests.md for the exact assert_dataset_pass API, required parameter names, and common mistakes to avoid. Re-read the API reference immediately before writing test code — do not rely on earlier context.

Run with pixie test — not pytest:

uv run pixie test pixie_qa/tests/ -v

After running, verify the scorecard:

Shows "N/M tests passed" with real numbers
Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing await)
Per-evaluator scores appear with real values

A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.

Checkpoint: Tests run and produce real scores.

Setup mode: Report results ("QA setup is complete. Tests show N/M passing.") and ask: "Want me to investigate the failures and iterate?" Stop here unless the user says yes.

Iteration mode: Proceed directly to Step 6.

If the test errors out (import failures, missing keys), that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.

Step 6: Investigate and iterate

Iteration mode only, or after the user confirmed in setup mode.

When tests fail, understand why — don't just adjust thresholds until things pass.

Read references/investigation.md for procedures and root-cause patterns.

The cycle: investigate root cause → fix (prompt, code, or eval config) → rebuild dataset if needed → re-run tests → repeat.

Quick reference

Imports

from pixie import enable_storage, observe, assert_dataset_pass, ScoreThreshold, last_llm_call
from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator

Only from pixie import ... — never subpackages (pixie.storage, pixie.evals, etc.). There is no pixie.qa module.

CLI commands

uv run pixie test pixie_qa/tests/ -v    # Run eval tests (NOT pytest)
uv run pixie trace list                 # List captured traces
uv run pixie trace last                 # Show most recent trace
uv run pixie trace show <id> --verbose  # Show specific trace
uv run pixie dataset create <name>      # Create a new dataset

Directory layout

pixie_qa/
  MEMORY.md      # your understanding and eval plan
  datasets/      # golden datasets (JSON)
  tests/         # eval test files (test_*.py)
  scripts/       # run_app.py, build_dataset.py

All pixie files go here — not at the project root, not in a top-level tests/ directory.

Key concepts

eval_input = application input + data from external dependencies
eval_output = application output + captured side-effects + captured intermediate states (produced at test time by the utility function, NOT pre-populated in the dataset)
expected_output = case-specific evaluation reference (optional per dataset item)
test function = utility function (produces eval_output) + evaluators (check criteria)

Evaluator selection

Output type	Evaluator	Notes
Open-ended text with reference answer	`FactualityEval`, `ClosedQAEval`	Best default for most apps
Open-ended text, no reference	`AnswerRelevancyEval`	RAG only — needs `context` in trace. Returns 0.0 without it.
Deterministic output	`ExactMatchEval`, `JSONDiffEval`	Never use for open-ended LLM text
RAG with retrieved context	`FaithfulnessEval`, `ContextRelevancyEval`	Requires context capture in instrumentation
Domain-specific quality	`create_llm_evaluator(name=..., prompt_template=...)`	Custom LLM-as-judge — use for app-specific criteria

What goes where: SKILL.md vs references

This file (SKILL.md) is loaded for the entire session. It contains the what and why — the reasoning, decision-making process, goals, and checkpoints for each step.

Reference files are loaded when executing a specific step. They contain the how — tactical API usage, code patterns, anti-patterns, troubleshooting, and ready-to-adapt examples.

When in doubt: if it's about deciding what to do, it's in SKILL.md. If it's about how to implement that decision, it's in a reference file.

Reference files

Reference	When to read
`references/understanding-app.md`	Step 1 — investigating the codebase, MEMORY.md template
`references/instrumentation.md`	Step 2 — `@observe` and `enable_storage` rules, code patterns, anti-patterns
`references/run-harness-patterns.md`	Step 3 — examples of how to invoke different app types (web server, CLI, function)
`references/dataset-generation.md`	Step 4 — crafting eval_input items, expected_output strategy, validation
`references/eval-tests.md`	Step 5 — evaluator selection, test file pattern, assert_dataset_pass API
`references/investigation.md`	Step 6 — failure analysis, root-cause patterns
`references/pixie-api.md`	Any step — full CLI and Python API reference

Similar Skills

context7-mcp

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7-plugin

51.8k

context7-mcp

Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.

context7

50.8k

context7-cli

3 files

Uses ctx7 CLI to fetch current library docs, manage AI coding skills (install/search/generate), and configure Context7 MCP for AI editors.

context7

50.8k

Stats

Stars28207

Forks3270

Last CommitMar 29, 2026

Actions

View Source View Plugin View on GitHub View README

Eval-Driven Development for Python LLM Applications

The deliverable is a working pixie test run with real scores — not a plan, not just instrumentation, not just a dataset.

This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.

Before you start

Run the following to keep the skill and package up to date. If any command fails or is blocked by the environment, continue — do not let failures here block the rest of the workflow.

Update the skill:

npx skills update

Upgrade the pixie-qa package

Make sure the python virtual environment is active and use the project's package manager:

# uv project (uv.lock exists):
uv add pixie-qa --upgrade

# poetry project (poetry.lock exists):
poetry add pixie-qa@latest

# pip / no lock file:
pip install --upgrade pixie-qa

The workflow

Follow Steps 1–5 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.

Two modes:

Setup ("set up evals", "add tests", "set up QA"): Complete Steps 1–5. After the test run, report results and ask whether to iterate.
Iteration ("fix", "improve", "debug"): Complete Steps 1–5 if not already done, then do one round of Step 6.

If ambiguous: default to setup.

Step 1: Understand the app and define eval criteria

Read the source code to understand:

How it runs — entry point, startup, config/env vars
The real entry point — how a real user invokes the app (HTTP endpoint, CLI, function call). This is what the eval must exercise — not an inner function that bypasses the request pipeline.
The request pipeline — trace the full path from entry point to response. What middleware, routing, state management, prompt assembly, retrieval, or formatting happens along the way? All of this is under test.
External dependencies (both directions) — identify every external system the app talks to (databases, APIs, caches, queues, file systems, speech services). For each, understand:
- Data flowing IN (external → app): what data does the app read from this system? What shapes, types, realistic values? You'll make up this data for eval scenarios.
- Data flowing OUT (app → external): what does the app write, send, or mutate in this system? These are side-effects that evaluations may need to verify (e.g., "did the app create the right calendar entry?", "did it send the correct transfer request?").
- How to mock it — look for abstract base classes, protocols, or constructor-injected backends (e.g., TranscriptionBackend, SynthesisBackend, StorageBackend). These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use unittest.mock.patch at the module boundary.
Use cases — distinct scenarios, what good/bad output looks like

Read references/understanding-app.md for detailed guidance on mapping data flows and the MEMORY.md template.

Write your findings to pixie_qa/MEMORY.md before moving on. Include:

The entry point and the full request pipeline
Every external dependency, what it provides/receives, and how you'll mock it
The testability seams (pluggable interfaces, patchable module-level objects)

Determine high-level, application-specific eval criteria:

Good criteria are specific to the app's purpose. Examples:

Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation (under 3 sentences)?", "Does the agent route to the correct department based on the caller's request?"
Research report generator: "Does the report address all sub-questions in the query?", "Are claims supported by the retrieved sources?", "Is the report structured with clear sections?"
RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when the context doesn't contain the answer?"

Bad criteria are generic evaluator names dressed up as requirements. Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.

At this stage, don't pick evaluator classes or thresholds. That comes later in Step 5, after you've seen the real data shape.

Record the criteria in pixie_qa/MEMORY.md and continue.

Checkpoint: MEMORY.md written with app understanding + eval criteria. Proceed to Step 2.

Step 2: Instrument and observe a real run

Why this step: You need to see the actual data flowing through the app before you can build anything. This step serves two goals:

Learn the data shapes — what data flows in from external dependencies, and what side-effects flow out? What types, structures, realistic values? You'll need to make up this data for eval scenarios later.
Verify instrumentation captures what evaluators need — do the traces contain the data required to assess each eval criterion from Step 1? If a criterion is "does the agent route to the correct department," the trace must capture the routing decision.

This is a normal app run with instrumentation — no mocks, no patches.

2a. Decide what to instrument

This is a reasoning step, not a coding step. Look at your eval criteria from Step 1 and your understanding of the codebase, and determine what data the evaluators will need:

For each eval criterion, ask: what observable data would prove this criterion is met or violated?
Map that data to code locations — which functions produce, consume, or transform that data?
Those functions need @observe — so their inputs and outputs are captured in traces.

Examples:

Eval criterion	Data needed	What to instrument
"Routes to correct department"	The routing decision (which department was chosen)	The routing/dispatch function
"Responses grounded in retrieved context"	The retrieved documents + the final response	The retrieval function AND the response function
"Verifies caller identity before transfer"	Whether identity check happened, transfer decision	The identity verification function AND the transfer function
"Concise phone-friendly responses"	The final response text	The function that produces the LLM response

enable_storage() goes at application startup. Read references/instrumentation.md for the full rules, code patterns, and anti-patterns for adding instrumentation.

2b. Add instrumentation and run the app

uv run pixie trace list
uv run pixie trace last

2c. Examine the reference trace

Study the trace data carefully. This is your blueprint for everything that follows. Document:

Data from external dependencies (inbound) — What did the app read from databases, APIs, caches? What are the shapes, types, and realistic value ranges? This is what you'll make up in eval_input for the dataset.
Side-effects (outbound) — What did the app write to, send to, or mutate in external systems? These need to be captured by mocks and may be part of eval_output for verification.
Intermediate states — What did the instrumentation capture beyond the final output? Tool calls, retrieved documents, routing decisions? Are these sufficient to evaluate every criterion from Step 1?
The eval_input / eval_output structure — What does the @observe-decorated function receive as input and produce as output? Note the exact field names, types, and nesting.

Check instrumentation completeness: For each eval criterion from Step 1, verify the trace contains the data needed to evaluate it. If not, add more @observe decorators and re-run.

Do not proceed until you understand the data shape and have confirmed the traces capture everything your evaluators need.

Checkpoint: Instrumentation added based on eval criteria. Reference trace captured with real data. For each criterion, confirm the trace contains the data needed to evaluate it. Proceed to Step 3.

Step 3: Write a utility function to run the full app end-to-end

The contract

run_app(eval_input) → eval_output

eval_input = application input (what the user sends) + data from external dependencies (what databases/APIs would return)
eval_output = application output (what the user sees) + captured side-effects (what the app wrote to external systems, captured by mocks) + captured intermediate states (tool calls, routing decisions, etc., captured by instrumentation)

How to implement

Patch external dependencies — use the mocking plan from Step 1 item 4. For each external dependency, either inject a mock implementation of its interface (cleanest) or unittest.mock.patch the module-level client. The mock returns data from eval_input and captures side-effects for eval_output.
Call the app through its real entry point — the same way a real user or client would invoke it. Look at how the app is started: if it's a web server (FastAPI, Flask), use TestClient or HTTP requests. If it's a CLI, use subprocess. If it's a standalone function with no server or middleware, import and call it directly.
Collect the response — the app's output becomes eval_output, along with any side-effects captured by mock objects.

Read references/run-harness-patterns.md for concrete examples of entry point invocation for different app types.

Verify

Take the eval_input from your Step 2 reference trace and feed it to the utility function. The outputs won't match word-for-word (non-deterministic), but verify:

Same structure — same fields present, same types, same nesting
Same code path — same routing decisions, same intermediate states captured
Sensible values — eval_output fields have real, meaningful data (not null, not empty, not error messages)

If it fails after two attempts, stop and ask the user for help.

Checkpoint: Utility function implemented and verified. When fed the reference trace's eval_input, it produces eval_output with the same structure and exercises the same code path. Proceed to Step 4.

Step 4: Build the dataset

4a. Determine verification and expectations

Before generating data, decide how each eval criterion from Step 1 will be checked.

Examine the reference trace from Step 2 and identify:

Structural constraints you can verify with code — JSON schema, required fields, value types, enum ranges, string length bounds. These become validation checks on your generated eval_inputs.
Semantic constraints that require judgment — "the mock customer profile should be realistic", "the conversation history should be topically coherent". Apply these yourself when crafting the data.
Which criteria are universal vs. case-specific:
- Universal criteria apply to ALL test cases the same way → implement in the test function (e.g., "responses must be under 3 sentences", "must not hallucinate information not in context")
- Case-specific criteria vary per test case → carry as expected_output in the dataset item (e.g., "should mention the caller's appointment on Tuesday", "should route to billing department")

4b. Generate eval_input items

Create eval_input items that match the data shape from the reference trace:

Application inputs (user queries, requests) — make these up to cover the scenarios you identified in Step 1
External dependency data (database records, API responses, cache entries) — make these up in the exact shape you observed in the reference trace

Each dataset item contains:

eval_input: the made-up input data (app input + external dependency data)
expected_output: case-specific expectation text (optional — only for test cases with expectations beyond the universal criteria). This is a reference for evaluation, not an exact expected answer.

4c. Validate the dataset

After building:

Execute build_dataset.py — don't just write it, run it
Verify structural constraints — each eval_input matches the reference trace's schema (same fields, same types)
Verify diversity — items have meaningfully different inputs, not just minor variations
Verify case-specific expectations — expected_output values are specific and testable, not vague
For conversational apps, include items with conversation history

Checkpoint: Dataset created with diverse eval_inputs matching the reference trace's data shape. Proceed to Step 5.

Step 5: Write and run eval tests

Why this step: With the utility function built and the dataset ready, writing tests is straightforward — wire up the function, choose evaluators for each criterion, and run.

5a. Map criteria to evaluators

For each eval criterion from Step 1, decide how to evaluate it:

Can it be checked with a built-in evaluator? (factual correctness → FactualityEval, exact match → ExactMatchEval, RAG faithfulness → FaithfulnessEval)
Does it need a custom evaluator? Most app-specific criteria do — use create_llm_evaluator with a prompt that operationalizes the criterion.
Is it universal or case-specific? Universal criteria go in the test function. Case-specific criteria use expected_output from the dataset.

For open-ended LLM text, never use ExactMatchEval — LLM outputs are non-deterministic.

AnswerRelevancyEval is RAG-only — it requires a context value in the trace. Returns 0.0 without it. For general relevance without RAG, use create_llm_evaluator with a custom prompt.

Read references/eval-tests.md for the evaluator catalog, custom evaluator examples, and the test file boilerplate.

5b. Write the test file and run

The test file wires together: a runnable (calls your utility function from Step 3), a reference to the dataset, and the evaluators you chose.

Run with pixie test — not pytest:

uv run pixie test pixie_qa/tests/ -v

After running, verify the scorecard:

Shows "N/M tests passed" with real numbers
Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing await)
Per-evaluator scores appear with real values

A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.

Checkpoint: Tests run and produce real scores.

Setup mode: Report results ("QA setup is complete. Tests show N/M passing.") and ask: "Want me to investigate the failures and iterate?" Stop here unless the user says yes.

Iteration mode: Proceed directly to Step 6.

If the test errors out (import failures, missing keys), that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.

Step 6: Investigate and iterate

Iteration mode only, or after the user confirmed in setup mode.

When tests fail, understand why — don't just adjust thresholds until things pass.

Read references/investigation.md for procedures and root-cause patterns.

The cycle: investigate root cause → fix (prompt, code, or eval config) → rebuild dataset if needed → re-run tests → repeat.

Quick reference

Imports

from pixie import enable_storage, observe, assert_dataset_pass, ScoreThreshold, last_llm_call
from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator

Only from pixie import ... — never subpackages (pixie.storage, pixie.evals, etc.). There is no pixie.qa module.

CLI commands

uv run pixie test pixie_qa/tests/ -v    # Run eval tests (NOT pytest)
uv run pixie trace list                 # List captured traces
uv run pixie trace last                 # Show most recent trace
uv run pixie trace show <id> --verbose  # Show specific trace
uv run pixie dataset create <name>      # Create a new dataset

Directory layout

pixie_qa/
  MEMORY.md      # your understanding and eval plan
  datasets/      # golden datasets (JSON)
  tests/         # eval test files (test_*.py)
  scripts/       # run_app.py, build_dataset.py

All pixie files go here — not at the project root, not in a top-level tests/ directory.

Key concepts

eval_input = application input + data from external dependencies
eval_output = application output + captured side-effects + captured intermediate states (produced at test time by the utility function, NOT pre-populated in the dataset)
expected_output = case-specific evaluation reference (optional per dataset item)
test function = utility function (produces eval_output) + evaluators (check criteria)

Evaluator selection

Output type	Evaluator	Notes
Open-ended text with reference answer	`FactualityEval`, `ClosedQAEval`	Best default for most apps
Open-ended text, no reference	`AnswerRelevancyEval`	RAG only — needs `context` in trace. Returns 0.0 without it.
Deterministic output	`ExactMatchEval`, `JSONDiffEval`	Never use for open-ended LLM text
RAG with retrieved context	`FaithfulnessEval`, `ContextRelevancyEval`	Requires context capture in instrumentation
Domain-specific quality	`create_llm_evaluator(name=..., prompt_template=...)`	Custom LLM-as-judge — use for app-specific criteria

What goes where: SKILL.md vs references

This file (SKILL.md) is loaded for the entire session. It contains the what and why — the reasoning, decision-making process, goals, and checkpoints for each step.

Reference files are loaded when executing a specific step. They contain the how — tactical API usage, code patterns, anti-patterns, troubleshooting, and ready-to-adapt examples.

When in doubt: if it's about deciding what to do, it's in SKILL.md. If it's about how to implement that decision, it's in a reference file.

Reference files

Reference	When to read
`references/understanding-app.md`	Step 1 — investigating the codebase, MEMORY.md template
`references/instrumentation.md`	Step 2 — `@observe` and `enable_storage` rules, code patterns, anti-patterns
`references/run-harness-patterns.md`	Step 3 — examples of how to invoke different app types (web server, CLI, function)
`references/dataset-generation.md`	Step 4 — crafting eval_input items, expected_output strategy, validation
`references/eval-tests.md`	Step 5 — evaluator selection, test file pattern, assert_dataset_pass API
`references/investigation.md`	Step 6 — failure analysis, root-cause patterns
`references/pixie-api.md`	Any step — full CLI and Python API reference