Generates quality playbook for any codebase: constitution (QUALITY.md), spec-traced functional tests, code review/integration protocols, multi-model spec audit, AI bootstrap (AGENTS.md).
From awesome-copilotnpx claudepluginhub ctr26/dotfiles --plugin awesome-copilotThis skill uses the workspace's default tool permissions.
LICENSE.txtreferences/constitution.mdreferences/defensive_patterns.mdreferences/functional_tests.mdreferences/review_protocols.mdreferences/schema_mapping.mdreferences/spec_audit.mdreferences/verification.mdFetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Uses ctx7 CLI to fetch current library docs, manage AI coding skills (install/search/generate), and configure Context7 MCP for AI editors.
When this skill starts, display this banner before doing anything else:
Quality Playbook v1.2.0 — by Andrew Stellman
https://github.com/andrewstellman/
Generate a complete quality system tailored to a specific codebase. Unlike test stub generators that work mechanically from source code, this skill explores the project first — understanding its domain, architecture, specifications, and failure history — then produces a quality playbook grounded in what it finds.
Most software projects have tests, but few have a quality system. Tests check whether code works. A quality system answers harder questions: what does "working correctly" mean for this specific project? What are the ways it could fail that wouldn't be caught by tests? What should every developer (human or AI) know before touching this code?
Without a quality playbook, every new contributor (and every new AI session) starts from scratch — guessing at what matters, writing tests that look good but don't catch real bugs, and rediscovering failure modes that were already found and fixed months ago. A quality playbook makes the bar explicit, persistent, and inherited.
Six files that together form a repeatable quality system:
| File | Purpose | Why It Matters | Executes Code? |
|---|---|---|---|
quality/QUALITY.md | Quality constitution — coverage targets, fitness-to-purpose scenarios, theater prevention | Every AI session reads this first. It tells them what "good enough" means so they don't guess. | No |
quality/test_functional.* | Automated functional tests derived from specifications | The safety net. Tests tied to what the spec says should happen, not just what the code does. Use the project's language: test_functional.py (Python), FunctionalSpec.scala (Scala), functional.test.ts (TypeScript), FunctionalTest.java (Java), etc. | Yes |
quality/RUN_CODE_REVIEW.md | Code review protocol with guardrails that prevent hallucinated findings | AI code reviews without guardrails produce confident but wrong findings. The guardrails (line numbers, grep before claiming, read bodies) often improve accuracy. | No |
quality/RUN_INTEGRATION_TESTS.md | Integration test protocol — end-to-end pipeline across all variants | Unit tests pass, but does the system actually work end-to-end with real external services? | Yes |
quality/RUN_SPEC_AUDIT.md | Council of Three multi-model spec audit protocol | No single AI model catches everything. Three independent models with different blind spots catch defects that any one alone would miss. | No |
AGENTS.md | Bootstrap context for any AI session working on this project | The "read this first" file. Without it, AI sessions waste their first hour figuring out what's going on. | No |
Plus output directories: quality/code_reviews/, quality/spec_audits/, quality/results/.
The critical deliverable is the functional test file (named for the project's language and test framework conventions). The Markdown protocols are documentation for humans and AI agents. The functional tests are the automated safety net.
Point this skill at any codebase:
Generate a quality playbook for this project.
Update the functional tests — the quality playbook already exists.
Run the spec audit protocol.
If a quality playbook already exists (quality/QUALITY.md, functional tests, etc.), read the existing files first, then evaluate them against the self-check benchmarks in the verification phase. Don't assume existing files are complete — treat them as a starting point.
Spend the first phase understanding the project. The quality playbook must be grounded in this specific codebase — not generic advice.
Why explore first? The most common failure in AI-generated quality playbooks is producing generic content — coverage targets that could apply to any project, scenarios that describe theoretical failures, tests that exercise language builtins instead of project code. Exploration prevents this by forcing every output to reference something real: a specific function, a specific schema, a specific defensive code pattern. If you can't point to where something lives in the code, you're guessing — and guesses produce quality playbooks nobody trusts.
Scaling for large codebases: For projects with more than ~50 source files, don't try to read everything. Focus exploration on the 3–5 core modules (the ones that handle the primary data flow, the most complex logic, and the most failure-prone operations). Read representative tests from each subsystem rather than every test file. The goal is depth on what matters, not breadth across everything.
Before exploring code, ask the user one question:
"Do you have exported AI chat history from developing this project — Claude exports, Gemini takeouts, ChatGPT exports, Claude Code transcripts, or similar? If so, point me to the folder. The design discussions, incident reports, and quality decisions in those chats will make the generated quality playbook significantly better."
If the user provides a chat history folder:
INDEX*, CONTEXT.md, README.md, or similar navigation aids. If one exists, read it — it will tell you what's there and how to find things.This context is gold. A chat history where the developer discussed "why we chose this concurrency model" or "the time we lost 1,693 records in production" transforms generic scenarios into authoritative ones.
If the user doesn't have chat history, proceed normally — the skill works without it, just with less context.
Read the README, existing documentation, and build config (pyproject.toml / package.json / Cargo.toml). Answer:
Find the specifications. Specs are the source of truth for functional tests. Search in order: AGENTS.md/CLAUDE.md in root, specs/, docs/, spec/, design/, architecture/, adr/, then .md files in root. Record the paths.
If no formal spec documents exist, the skill still works — but you need to assemble requirements from other sources. In order of preference:
process(x) == y, that's a requirement.When working from non-formal requirements, label each scenario and test with a requirement tag that includes a confidence tier and source:
[Req: formal — README §3] — written by humans in a spec document. Authoritative.[Req: user-confirmed — "must handle empty input"] — stated by the user but not in a formal doc. Treat as authoritative.[Req: inferred — from validate_input() behavior] — deduced from code. Flag for user review.Use this exact tag format in QUALITY.md scenarios, functional test documentation, and spec audit findings. It makes clear which requirements are authoritative and which need validation.
List source directories and their purposes. Read the main entry point, trace execution flow. Identify:
Read the existing test files — all of them for small/medium projects, or a representative sample from each subsystem for large ones. Identify: test count, coverage patterns, gaps, and any coverage theater (tests that look good but don't catch real bugs).
Critical: Record the import pattern. How do existing tests import project modules? Every language has its own conventions (Python sys.path manipulation, Java/Scala package imports, TypeScript relative paths or aliases, Go package/module paths, Rust use crate:: or use myproject::). You must use the exact same pattern in your functional tests — getting this wrong means every test fails with import/resolution errors. See references/functional_tests.md § "Import Pattern" for the full six-language matrix.
Identify integration test runners. Look for scripts or test files that exercise the system end-to-end against real external services (APIs, databases, etc.). Note their patterns — you'll need them for RUN_INTEGRATION_TESTS.md.
Walk each spec document section by section. For every section, ask: "What testable requirement does this state?" Record spec requirements without corresponding tests — these are the gaps the functional tests must close.
If using inferred requirements (from tests, types, or code behavior), tag each with its confidence tier using the [Req: tier — source] format defined in Step 1. Inferred requirements feed into QUALITY.md scenarios and should be flagged for user review in Phase 4.
Before writing any test, you must know exactly how each function is called. For every module you identified in Step 2:
pipelines/, fixtures/, test_data/, examples/), read them. Your test fixtures must match the real data shape exactly.requirements.txt, build.sbt, package.json, pom.xml/build.gradle, go.mod, Cargo.toml) to see what's actually available. Don't write tests that depend on library features that aren't installed. If a dependency might be missing, use the test framework's skip mechanism — see references/functional_tests.md § "Library version awareness" for framework-specific examples.Record a function call map: for each function you plan to test, write down its name, module, parameters, and what it returns. This map prevents the most common test failure: calling functions with wrong arguments.
This is the most important step. Search for defensive code patterns — each one is evidence of a past failure or known risk.
Why this matters: Developers don't write try/except blocks, null checks, or retry logic for fun. Every piece of defensive code exists because someone got burned. A try/except around a JSON parse means malformed JSON happened in production. A null check on a field means that field was missing when it shouldn't have been. These patterns are the codebase whispering its history of failures. Each one becomes a fitness-to-purpose scenario and a boundary test.
Read references/defensive_patterns.md for the systematic search approach, grep patterns, and how to convert findings into fitness-to-purpose scenarios and boundary tests.
Minimum bar: at least 2–3 defensive patterns per core source file. If you find fewer, you're skimming — read function bodies, not just signatures.
If the project has any kind of state management — status fields, lifecycle phases, workflow stages, mode flags — trace the state machine completely. This catches a category of bugs that defensive pattern analysis alone misses: states that exist but aren't handled.
How to find state machines: Search for status/state fields in models, enums, or constants (e.g., status, state, phase, mode). Search for guards that check status before allowing actions (e.g., if status == "running", match self.state). Search for state transitions (assignments to status fields).
For each state machine you find:
switch/match without a meaningful default, or an if/elif chain that doesn't cover all states, is a gap.Why this matters: State machine gaps produce bugs that are invisible during normal operation but surface under stress or edge conditions — exactly when you need the system to work. A batch processor that can't be killed when it's in "stuck" status, or a watcher that never self-terminates after all work completes, or a UI that refuses to resume a "pending" run, are all symptoms of incomplete state handling. These bugs don't show up in defensive pattern analysis because the code isn't defending against them — it's simply not handling them at all.
If the project has a validation layer (Pydantic models in Python, JSON Schema, TypeScript interfaces/Zod schemas, Java Bean Validation annotations, Scala case class codecs), read the schema definitions now. For every field you found a defensive pattern for, record what the schema accepts vs. rejects.
Read references/schema_mapping.md for the mapping format and why this matters for writing valid boundary tests.
Every project has a different failure profile. This step uses two sources — not just code exploration, but your training knowledge of what goes wrong in similar systems.
From code exploration, ask:
From domain knowledge, ask:
Generate realistic failure scenarios from this knowledge. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write them as architectural vulnerability analyses with specific quantities and consequences. Frame each as "this architecture permits the following failure mode" — not as a fabricated incident report. Use concrete numbers to make the severity non-negotiable: "If the process crashes mid-write during a 10,000-record batch, save_state() without an atomic rename pattern will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention." Then ground them in the actual code you explored: "Read persistence.py line ~340 (save_state): verify temp file + rename pattern."
Now write the six files. For each one, follow the structure below and consult the relevant reference file for detailed guidance.
Why six files instead of just tests? Tests catch regressions but don't prevent new categories of bugs. The quality constitution (QUALITY.md) tells future sessions what "correct" means before they start writing code. The protocols (RUN_*.md) provide structured processes for review, integration testing, and spec auditing that produce repeatable results — instead of leaving quality to whatever the AI feels like checking. Together, these files create a quality system where each piece reinforces the others: scenarios in QUALITY.md map to tests in the functional test file, which are verified by the integration protocol, which is audited by the Council of Three.
quality/QUALITY.md — Quality ConstitutionRead references/constitution.md for the full template and examples.
The constitution has six sections:
Scenario voice is critical. Write "What happened" as architectural vulnerability analyses with specific quantities, cascade consequences, and detection difficulty — not as abstract specifications. "Because save_state() lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume. At scale, this risks silent loss of 1,693+ records with no detection mechanism." An AI session reading that will not argue the standard down. Use your knowledge of similar systems to generate realistic failure scenarios, then ground them in the actual code you explored. Scenarios come from both code exploration AND domain knowledge about what goes wrong in systems like this.
Every scenario's "How to verify" must map to at least one test in the functional test file.
This is the most important deliverable. Read references/functional_tests.md for the complete guide.
Organize the tests into three logical groups (classes, describe blocks, modules, or whatever the test framework uses):
Key rules:
def line — parameter names, types, defaults. Read real data files from the project to understand data shapes. Do not guess at function parameters or fixture structures.pass or the assertion is trivial (assert isinstance(x, list)), delete it. A test that doesn't exercise project code inflates the count and creates false confidence.quality/RUN_CODE_REVIEW.mdRead references/review_protocols.md for the template.
Key sections: bootstrap files, focus areas mapped to architecture, and these mandatory guardrails:
Phase 2: Regression tests. After the review produces BUG findings, write regression tests in quality/test_regression.* that reproduce each bug. Each test should fail on the current implementation, confirming the bug is real. Report results as a confirmation table (BUG CONFIRMED / FALSE POSITIVE / NEEDS INVESTIGATION). See references/review_protocols.md for the full regression test protocol.
quality/RUN_INTEGRATION_TESTS.mdRead references/review_protocols.md for the template.
Must include: safety constraints, pre-flight checks, test matrix with specific pass criteria, an execution UX section, and a structured reporting format. Cover happy path, cross-variant consistency, output correctness, and component boundaries.
All commands must use relative paths. The generated protocol should include a "Working Directory" section at the top stating that all commands run from the project root using relative paths. Never generate commands that cd to an absolute path — this breaks when the protocol is run from a different machine or directory. Use ./scripts/, ./pipelines/, ./quality/, etc.
Include an Execution UX section. When someone tells an AI agent to "run the integration tests," the agent needs to know how to present its work. The protocol should specify three phases: (1) show the plan as a numbered table before running anything, (2) report one-line progress updates as each test runs (✓/✗/⧗), (3) show a summary table with pass/fail counts and a recommendation. See references/review_protocols.md section "Execution UX" for the template and examples. Without this, the agent dumps raw output or stays silent — neither is useful.
This protocol must exercise real external dependencies. If the project talks to APIs, databases, or external services, the integration test protocol runs real end-to-end executions against those services — not just local validation checks. Design the test matrix around the project's actual execution modes and external dependencies. Look for API keys, provider abstractions, and existing integration test scripts during exploration and build on them.
Derive quality gates from the code, not generic checks. Read validation rules, schema enums, and generation logic during exploration. Turn them into per-pipeline quality checks with specific fields and acceptable value ranges. "All units validated" is not enough — the protocol must verify domain-specific correctness.
Script parallelism, don't just describe it. Group runs so independent executions (different providers) run concurrently. Include actual bash commands with & and wait. One run per provider at a time to avoid rate limits.
Calibrate unit counts to the project. Read chunk_size or equivalent config. Use enough units to span at least 2 chunks and enough to verify distribution checks. Typically 10–30 for integration testing.
Deep post-run verification. Don't stop at "process completed." Verify log files, manifest state, output data existence, sample record content, and any existing quality check scripts — for every run.
Find and use existing verification tools. Search for existing scripts that verify output quality (e.g., integration_checks.py, validation scripts, quality gate functions). If they exist, call them from the protocol. If the project has a TUI or dashboard, include TUI verification commands (e.g., --dump flags) in the post-run checklist.
Build a Field Reference Table before writing quality gates. This is the most important step for protocol accuracy. AI models confidently write wrong field names even after reading schemas — document_id becomes doc_id, sentiment_score becomes sentiment, float 0-1 becomes int 0-100. The fix is procedural: re-read each schema file IMMEDIATELY before writing each table row. Do not rely on what you read earlier in the conversation — your memory of field names drifts over thousands of tokens. Copy field names character-for-character from the file contents. Include ALL fields from each schema (if the schema has 8 fields, the table has 8 rows). See references/review_protocols.md section "The Field Reference Table" for the full process and format. Do not skip this step — it prevents the single most common protocol inaccuracy.
quality/RUN_SPEC_AUDIT.md — Council of ThreeRead references/spec_audit.md for the full protocol.
Three independent AI models audit the code against specifications. Why three? Because each model has different blind spots — in practice, different auditors catch different issues. Cross-referencing catches what any single model misses.
The protocol defines: a copy-pasteable audit prompt with guardrails, project-specific scrutiny areas, a triage process (merge findings by confidence level), and fix execution rules (small batches by subsystem, not mega-prompts).
AGENTS.mdIf AGENTS.md already exists, update it — don't replace it. Add a Quality Docs section pointing to all generated files.
If creating from scratch: project description, setup commands, build & test commands, architecture overview, key design decisions, known quirks, and quality docs pointers.
Why a verification phase? AI-generated output can look polished and be subtly wrong. Tests that reference undefined fixtures report 0 failures but 16 errors — and "0 failures" sounds like success. Integration protocols can list field names that don't exist in the actual schemas. The verification phase catches these problems before the user discovers them, which is important because trust in a generated quality playbook is fragile — one wrong field name undermines confidence in everything else.
Before declaring done, check every benchmark. Read references/verification.md for the complete checklist.
The critical checks:
pytest -v, Scala: sbt testOnly, Java: mvn test/gradle test, TypeScript: npx jest, Go: go test -v, Rust: cargo test) and check the summary. Errors from missing fixtures, failed imports, or unresolved dependencies count as broken tests. If you see setup errors, you forgot to create the fixture/setup file or referenced undefined test helpers.If any benchmark fails, go back and fix it before proceeding.
After generating and verifying, present the results clearly and give the user control over what happens next. This phase has three parts: a scannable summary, drill-down on demand, and a menu of improvement paths.
Do not skip this phase. The autonomous output from Phases 1-3 is a solid starting point, but the user needs to understand what was generated, explore what matters to them, and choose how to improve it. A quality playbook is only useful if the people who own the project trust it and understand it. Dumping six files without explanation creates artifacts nobody reads.
Present a single table the user can scan in 10 seconds:
Here's what I generated:
| File | What It Does | Key Metric | Confidence |
|------|-------------|------------|------------|
| QUALITY.md | Quality constitution | 10 scenarios | ██████░░ High — grounded in code, but scenarios are inferred, not from real incidents |
| Functional tests | Automated tests | 47 passing | ████████ High — all tests pass, 35% cross-variant |
| RUN_CODE_REVIEW.md | Code review protocol | 8 focus areas | ████████ High — derived from architecture |
| RUN_INTEGRATION_TESTS.md | Integration test protocol | 9 runs × 3 providers | ██████░░ Medium — quality gates need threshold tuning |
| RUN_SPEC_AUDIT.md | Council of Three audit | 10 scrutiny areas | ████████ High — guardrails included |
| AGENTS.md | AI session bootstrap | Updated | ████████ High — factual |
Adapt the table to what you actually generated — the file names, metrics, and confidence levels will vary by project. The confidence column is the most important: it tells the user where to focus their attention.
Confidence levels:
After the table, add a "Quick Start" block with ready-to-copy prompts for executing each artifact:
To use these artifacts, start a new AI session and try one of these prompts:
• Run a code review:
"Read quality/RUN_CODE_REVIEW.md and follow its instructions to review [module or file]."
• Run the functional tests:
"[test runner command, e.g. pytest quality/ -v, mvn test -Dtest=FunctionalTest, etc.]"
• Run the integration tests:
"Read quality/RUN_INTEGRATION_TESTS.md and follow its instructions."
• Start a spec audit (Council of Three):
"Read quality/RUN_SPEC_AUDIT.md and follow its instructions using [model name]."
Adapt the test runner command and module names to the actual project. The point is to give the user copy-pasteable prompts — not descriptions of what they could do, but the actual text they'd type.
After the Quick Start block, add one line:
"You can ask me about any of these to see the details — for example, 'show me Scenario 3' or 'walk me through the integration test matrix.'"
When the user asks about a specific item, give a focused summary — not the whole file, but the key decisions and what you're uncertain about. Examples:
The user may go through several drill-downs before they're ready to improve anything. That's fine — let them explore at their own pace.
After the user has seen the summary (and optionally drilled into details), present the improvement options:
"Three ways to make this better:"
1. Review and harden individual items — Pick any scenario, test, or protocol section and I'll walk through it with you. Good for: tightening specific quality gates, fixing inferred scenarios, adding missing edge cases.
2. Guided Q&A — I'll ask you 3-5 targeted questions about things I couldn't infer from the code: incident history, expected distributions, cost tolerance, model preferences. Good for: filling knowledge gaps that make scenarios more authoritative.
3. Review development history — Point me to exported AI chat history (Claude, Gemini, ChatGPT exports, Claude Code transcripts) and I'll mine it for design decisions, incident reports, and quality discussions that should be in QUALITY.md. Good for: grounding scenarios in real project history instead of inference.
"You can do any combination of these, in any order. Which would you like to start with?"
Path 1: Review and harden. The user picks an item. Walk through it: show the current text, explain your reasoning, ask if it's accurate. Revise based on their feedback. Re-run tests if the functional tests change.
Path 2: Guided Q&A. Ask 3-5 questions derived from what you actually found during exploration. These categories cover the most common high-leverage gaps:
After the user answers, revise the generated files and re-run tests.
Path 3: Review development history. If the user provides a chat history folder:
If the user already provided chat history in Step 0, you've already mined it — but they may want to point you to specific conversations or ask you to dig deeper into a particular topic.
The user can cycle through these paths as many times as they want. Each pass makes the quality playbook more grounded. When they're satisfied, they'll move on naturally — there's no explicit "done" step.
The quality/ folder is separate from the project's unit test folder. Create the appropriate test setup for the project's language:
quality/conftest.py for pytest fixtures. If fixtures are defined inline (common with pytest's tmp_path pattern), prefer that over shared fixtures.@BeforeEach/@BeforeAll setup methods, or a shared test utility class.trait FunctionalTestFixtures), or inline data builders.quality/setup.ts with beforeAll/beforeEach hooks, or inline test factories._test.go file or a shared testutil_test.go. Use t.Helper() for test helpers. Go convention prefers inline test setup over shared fixtures.#[cfg(test)] mod tests block, or a shared test_utils.rs module. Use builder patterns for test data.Examine existing test files to understand how they set up test data. Whatever pattern the existing tests use, copy it. Study existing fixture patterns for realistic data shapes.
Read these as you work through each phase:
| File | When to Read | Contains |
|---|---|---|
references/defensive_patterns.md | Step 5 (finding skeletons) | Grep patterns, how to convert findings to scenarios |
references/schema_mapping.md | Step 5b (schema types) | Field mapping format, mutation validity rules |
references/constitution.md | File 1 (QUALITY.md) | Full template with section-by-section guidance |
references/functional_tests.md | File 2 (functional tests) | Test structure, anti-patterns, cross-variant strategy |
references/review_protocols.md | Files 3–4 (code review, integration) | Templates for both protocols |
references/spec_audit.md | File 5 (Council of Three) | Full audit protocol, triage process, fix execution |
references/verification.md | Phase 3 (verify) | Complete self-check checklist with all 13 benchmarks |