Sets up eval-based QA for Python LLM apps with pixie-qa: instruments code, builds golden datasets, runs end-to-end tests, iterates on failures.
From awesome-copilotnpx claudepluginhub ctr26/dotfiles --plugin awesome-copilotThis skill uses the workspace's default tool permissions.
references/dataset-generation.mdreferences/eval-tests.mdreferences/instrumentation.mdreferences/investigation.mdreferences/pixie-api.mdreferences/run-harness-patterns.mdreferences/understanding-app.mdFetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Uses ctx7 CLI to fetch current library docs, manage AI coding skills (install/search/generate), and configure Context7 MCP for AI editors.
You're building an automated QA pipeline that tests a Python application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via pixie test.
What you're testing is the app itself — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of assertEqual — but the thing under test is the app's code, not the LLM.
What's in scope: the app's entire code path from entry point to response — never mock or skip any part of it. What's out of scope: external data sources the app reads from (databases, caches, third-party APIs, voice streams) — mock these to control inputs and reduce flakiness.
The deliverable is a working pixie test run with real scores — not a plan, not just instrumentation, not just a dataset.
This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.
Run the following to keep the skill and package up to date. If any command fails or is blocked by the environment, continue — do not let failures here block the rest of the workflow.
Update the skill:
npx skills update
Upgrade the pixie-qa package
Make sure the python virtual environment is active and use the project's package manager:
# uv project (uv.lock exists):
uv add pixie-qa --upgrade
# poetry project (poetry.lock exists):
poetry add pixie-qa@latest
# pip / no lock file:
pip install --upgrade pixie-qa
Follow Steps 1–5 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.
Two modes:
If ambiguous: default to setup.
Read the source code to understand:
TranscriptionBackend, SynthesisBackend, StorageBackend). These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use unittest.mock.patch at the module boundary.Read references/understanding-app.md for detailed guidance on mapping data flows and the MEMORY.md template.
Write your findings to pixie_qa/MEMORY.md before moving on. Include:
Determine high-level, application-specific eval criteria:
Good criteria are specific to the app's purpose. Examples:
Bad criteria are generic evaluator names dressed up as requirements. Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
At this stage, don't pick evaluator classes or thresholds. That comes later in Step 5, after you've seen the real data shape.
Record the criteria in pixie_qa/MEMORY.md and continue.
Checkpoint: MEMORY.md written with app understanding + eval criteria. Proceed to Step 2.
Why this step: You need to see the actual data flowing through the app before you can build anything. This step serves two goals:
This is a normal app run with instrumentation — no mocks, no patches.
This is a reasoning step, not a coding step. Look at your eval criteria from Step 1 and your understanding of the codebase, and determine what data the evaluators will need:
@observe — so their inputs and outputs are captured in traces.Examples:
| Eval criterion | Data needed | What to instrument |
|---|---|---|
| "Routes to correct department" | The routing decision (which department was chosen) | The routing/dispatch function |
| "Responses grounded in retrieved context" | The retrieved documents + the final response | The retrieval function AND the response function |
| "Verifies caller identity before transfer" | Whether identity check happened, transfer decision | The identity verification function AND the transfer function |
| "Concise phone-friendly responses" | The final response text | The function that produces the LLM response |
LLM provider calls (OpenAI, Anthropic, etc.) are auto-captured — enable_storage() activates OpenInference instrumentors that automatically trace every LLM API call with full input messages, output messages, token usage, and model parameters. You do NOT need @observe on the function that calls client.chat.completions.create() just to see the LLM interaction.
Use @observe for application-level functions whose inputs, outputs, or intermediate states your evaluators need but that aren't visible from the LLM call alone. Examples: the app's entry-point function (to capture what the user sent and what the app returned), retrieval functions (to capture what context was fetched), routing functions (to capture dispatch decisions).
enable_storage() goes at application startup. Read references/instrumentation.md for the full rules, code patterns, and anti-patterns for adding instrumentation.
Add @observe to the functions you identified in 2a. Then run the app normally — with its real external dependencies, or by manually interacting with it — to produce a reference trace. Do NOT mock or patch anything. This is an observation run.
If the app can't run without infrastructure you don't have (a real database, third-party service credentials, etc.), use the simplest possible approach to get it running — a local Docker container, a test account, or ask the user for help. The goal is one real trace.
uv run pixie trace list
uv run pixie trace last
Study the trace data carefully. This is your blueprint for everything that follows. Document:
@observe-decorated function receive as input and produce as output? Note the exact field names, types, and nesting.Check instrumentation completeness: For each eval criterion from Step 1, verify the trace contains the data needed to evaluate it. If not, add more @observe decorators and re-run.
Do not proceed until you understand the data shape and have confirmed the traces capture everything your evaluators need.
Checkpoint: Instrumentation added based on eval criteria. Reference trace captured with real data. For each criterion, confirm the trace contains the data needed to evaluate it. Proceed to Step 3.
Why this step: You need a function that test cases can call. Given an eval_input (app input + mock data for external dependencies), it starts the real application with external dependencies patched, sends the input through the app's real entry point, and returns the eval_output (app response + captured side-effects).
run_app(eval_input) → eval_output
Patch external dependencies — use the mocking plan from Step 1 item 4. For each external dependency, either inject a mock implementation of its interface (cleanest) or unittest.mock.patch the module-level client. The mock returns data from eval_input and captures side-effects for eval_output.
Call the app through its real entry point — the same way a real user or client would invoke it. Look at how the app is started: if it's a web server (FastAPI, Flask), use TestClient or HTTP requests. If it's a CLI, use subprocess. If it's a standalone function with no server or middleware, import and call it directly.
Collect the response — the app's output becomes eval_output, along with any side-effects captured by mock objects.
Read references/run-harness-patterns.md for concrete examples of entry point invocation for different app types.
Do NOT call an inner function like agent.respond() directly just because it's simpler. The whole point is to test the app's real code path — request handling, state management, prompt assembly, routing. When you call an inner function directly, you skip all of that, and the test has to reimplement it. Now you're testing test code, not app code.
Take the eval_input from your Step 2 reference trace and feed it to the utility function. The outputs won't match word-for-word (non-deterministic), but verify:
If it fails after two attempts, stop and ask the user for help.
Checkpoint: Utility function implemented and verified. When fed the reference trace's eval_input, it produces eval_output with the same structure and exercises the same code path. Proceed to Step 4.
Why this step: The dataset is a collection of eval_input items (made up by you) that define the test scenarios. Each item may also carry case-specific expectations. The eval_output is NOT pre-populated in the dataset — it's produced at test time by the utility function from Step 3.
Before generating data, decide how each eval criterion from Step 1 will be checked.
Examine the reference trace from Step 2 and identify:
expected_output in the dataset item (e.g., "should mention the caller's appointment on Tuesday", "should route to billing department")Create eval_input items that match the data shape from the reference trace:
Each dataset item contains:
eval_input: the made-up input data (app input + external dependency data)expected_output: case-specific expectation text (optional — only for test cases with expectations beyond the universal criteria). This is a reference for evaluation, not an exact expected answer.At test time, eval_output is produced by the utility function from Step 3 and is not stored in the dataset itself.
Read references/dataset-generation.md for the dataset creation API, data shape matching, expected_output strategy, and validation checklist.
After building:
build_dataset.py — don't just write it, run itexpected_output values are specific and testable, not vagueCheckpoint: Dataset created with diverse eval_inputs matching the reference trace's data shape. Proceed to Step 5.
Why this step: With the utility function built and the dataset ready, writing tests is straightforward — wire up the function, choose evaluators for each criterion, and run.
For each eval criterion from Step 1, decide how to evaluate it:
FactualityEval, exact match → ExactMatchEval, RAG faithfulness → FaithfulnessEval)create_llm_evaluator with a prompt that operationalizes the criterion.expected_output from the dataset.For open-ended LLM text, never use ExactMatchEval — LLM outputs are non-deterministic.
AnswerRelevancyEval is RAG-only — it requires a context value in the trace. Returns 0.0 without it. For general relevance without RAG, use create_llm_evaluator with a custom prompt.
Read references/eval-tests.md for the evaluator catalog, custom evaluator examples, and the test file boilerplate.
The test file wires together: a runnable (calls your utility function from Step 3), a reference to the dataset, and the evaluators you chose.
Read references/eval-tests.md for the exact assert_dataset_pass API, required parameter names, and common mistakes to avoid. Re-read the API reference immediately before writing test code — do not rely on earlier context.
Run with pixie test — not pytest:
uv run pixie test pixie_qa/tests/ -v
After running, verify the scorecard:
await)A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.
Checkpoint: Tests run and produce real scores.
- Setup mode: Report results ("QA setup is complete. Tests show N/M passing.") and ask: "Want me to investigate the failures and iterate?" Stop here unless the user says yes.
- Iteration mode: Proceed directly to Step 6.
If the test errors out (import failures, missing keys), that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.
Iteration mode only, or after the user confirmed in setup mode.
When tests fail, understand why — don't just adjust thresholds until things pass.
Read references/investigation.md for procedures and root-cause patterns.
The cycle: investigate root cause → fix (prompt, code, or eval config) → rebuild dataset if needed → re-run tests → repeat.
from pixie import enable_storage, observe, assert_dataset_pass, ScoreThreshold, last_llm_call
from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
Only from pixie import ... — never subpackages (pixie.storage, pixie.evals, etc.). There is no pixie.qa module.
uv run pixie test pixie_qa/tests/ -v # Run eval tests (NOT pytest)
uv run pixie trace list # List captured traces
uv run pixie trace last # Show most recent trace
uv run pixie trace show <id> --verbose # Show specific trace
uv run pixie dataset create <name> # Create a new dataset
pixie_qa/
MEMORY.md # your understanding and eval plan
datasets/ # golden datasets (JSON)
tests/ # eval test files (test_*.py)
scripts/ # run_app.py, build_dataset.py
All pixie files go here — not at the project root, not in a top-level tests/ directory.
| Output type | Evaluator | Notes |
|---|---|---|
| Open-ended text with reference answer | FactualityEval, ClosedQAEval | Best default for most apps |
| Open-ended text, no reference | AnswerRelevancyEval | RAG only — needs context in trace. Returns 0.0 without it. |
| Deterministic output | ExactMatchEval, JSONDiffEval | Never use for open-ended LLM text |
| RAG with retrieved context | FaithfulnessEval, ContextRelevancyEval | Requires context capture in instrumentation |
| Domain-specific quality | create_llm_evaluator(name=..., prompt_template=...) | Custom LLM-as-judge — use for app-specific criteria |
This file (SKILL.md) is loaded for the entire session. It contains the what and why — the reasoning, decision-making process, goals, and checkpoints for each step.
Reference files are loaded when executing a specific step. They contain the how — tactical API usage, code patterns, anti-patterns, troubleshooting, and ready-to-adapt examples.
When in doubt: if it's about deciding what to do, it's in SKILL.md. If it's about how to implement that decision, it's in a reference file.
| Reference | When to read |
|---|---|
references/understanding-app.md | Step 1 — investigating the codebase, MEMORY.md template |
references/instrumentation.md | Step 2 — @observe and enable_storage rules, code patterns, anti-patterns |
references/run-harness-patterns.md | Step 3 — examples of how to invoke different app types (web server, CLI, function) |
references/dataset-generation.md | Step 4 — crafting eval_input items, expected_output strategy, validation |
references/eval-tests.md | Step 5 — evaluator selection, test file pattern, assert_dataset_pass API |
references/investigation.md | Step 6 — failure analysis, root-cause patterns |
references/pixie-api.md | Any step — full CLI and Python API reference |