From pm-copilot
Use this skill when the user asks about "continuous improvement for AI", "AI quality flywheel", "how do we keep improving our AI feature", "closing the eval feedback loop", "systematic AI improvement process", or wants to build a repeating process that continuously improves AI product quality over time rather than doing one-off fixes.
npx claudepluginhub productfculty-aipm/pm-copilot-by-product-facultyThis skill uses the workspace's default tool permissions.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Executes pre-written implementation plans: critically reviews, follows bite-sized steps exactly, runs verifications, tracks progress with checkpoints, uses git worktrees, stops on blockers.
Guides idea refinement into designs: explores context, asks questions one-by-one, proposes approaches, presents sections for approval, writes/review specs before coding.
You are helping the user design and operate a continuous improvement flywheel for their AI feature — a repeating cycle that systematically finds failures, fixes them, validates the fix, and builds institutional knowledge, compounding quality over time.
Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), kaizen principles applied to AI product quality.
Read memory/user-profile.md for the AI feature and any existing eval setup. Understand where the user is in the maturity journey: just starting, have error analysis done, have some evals, or trying to scale the process.
The continuous improvement flywheel has 6 stages. Each feeds the next.
Stage 1 — Observe (Collect signals) What: Gather signals about where the AI is failing. Sources: User feedback (thumbs down), support tickets, production sampling (random 5% of outputs reviewed weekly), analytics (engagement drop after certain output types). Key question: "What are users complaining about?" Output: Raw failure signals.
Stage 2 — Analyze (Understand failure patterns) What: Apply error analysis (open coding → axial coding) to the collected failures. Run: Every 4 weeks, or after any significant product change. Key question: "What are our top 3 failure categories?" Output: Ranked failure categories with frequency counts.
Stage 3 — Fix (Address root causes) What: For each top failure category, implement a fix. Fix types:
Stage 4 — Evaluate (Validate the fix) What: Run the regression test suite before deploying the fix. Also run: Human eval on 50 examples to confirm the fix didn't create new failures. Key question: "Did the fix work, and did it break anything else?" Output: Pass/fail result. Green light to deploy.
Stage 5 — Deploy (Ship the fix) What: Deploy the improved prompt/system. Measure the impact on production metrics. Watch: Did the fix show up in production? (Compare production eval pass rate before/after) Key question: "Did users experience the improvement?" Output: Validated improvement in production quality.
Stage 6 — Learn (Build institutional knowledge) What: Document the failure, the fix, and the lesson learned. Update the test suite with the new failure mode so it can never regress silently. Key question: "What do we now know that we didn't before?" Output: Updated regression test suite + internal doc of what worked and what didn't.
Then: return to Stage 1 (Observe).
Set up a sustainable cadence:
| Activity | Frequency | Owner |
|---|---|---|
| Production sampling (50–100 outputs) | Weekly | PM or analyst |
| Error analysis on accumulated failures | Monthly | PM |
| Top-3 failure category review | Monthly | PM + engineering |
| Fix implementation | Sprint-based | Engineering |
| Regression test run | Every PR | Automated |
| Full human eval sweep | Quarterly | PM + domain expert |
| Flywheel retrospective | Quarterly | PM + engineering + design |
Track the flywheel's effectiveness:
Explain to the user why this compounds over time:
Produce: