From agent-artifex
Use when the user asks "what testing do we need?", "what are our testing gaps?", "we have some tests but are they enough?", "is our MCP server well-tested?", "what should we test next?", "audit our test coverage for AI", "we keep getting bad responses and don't know why", "our agent picks the wrong tool sometimes", "how is my tool design?", "are my descriptions good enough?", "review my error messages", "is my system prompt well-designed?", "audit my MCP server design", "what design gaps do I have?", or needs to diagnose AI design or testing gaps in an existing project. Also use when someone says "assess my testing", "review our test strategy for AI", or "assess my design".
npx claudepluginhub flexion/claude-domestique --plugin agent-artifexThis skill uses the workspace's default tool permissions.
Diagnostic skill that identifies design quality issues and testing gaps in an existing project. Produces a prioritized gap analysis with specific recommendations covering both how well the AI integration is designed and how well it is tested. For learning the concepts first, use `agent-artifex:foundations` or `agent-artifex:learn`. For applying design principles, use `agent-artifex:design`. For...
Applies Acme Corporation brand guidelines including colors, fonts, layouts, and messaging to generated PowerPoint, Excel, and PDF documents.
Builds DCF models with sensitivity analysis, Monte Carlo simulations, and scenario planning for investment valuation and risk assessment.
Calculates profitability (ROE, margins), liquidity (current ratio), leverage, efficiency, and valuation (P/E, EV/EBITDA) ratios from financial statements in CSV, JSON, text, or Excel for investment analysis.
Diagnostic skill that identifies design quality issues and testing gaps in an existing project. Produces a prioritized gap analysis with specific recommendations covering both how well the AI integration is designed and how well it is tested. For learning the concepts first, use agent-artifex:foundations or agent-artifex:learn. For applying design principles, use agent-artifex:design. For writing tests or implementing design fixes after assessment, use agent-artifex:implement.
Read these when you need precise definitions during assessment:
| Reference | When to read |
|---|---|
agent-artifex/references/framework.md | Causal chain diagnostic flow, testing pyramid layers, two-tier model, impact/effort table |
agent-artifex/references/metrics.md | Formulas for any metric (SR/AE/AS, Faithfulness/Completeness, CRR/WCR/DASR) |
agent-artifex/references/rubric.md | Six-component rubric, labeling rules, structural detection markers |
agent-artifex/references/evidence.md | Key numbers to cite when justifying recommendations |
Three passes: inventory, gap analysis, and prioritized recommendations.
What are you building?
What tests exist today?
What problems are you experiencing?
Have you followed design guidelines for your tool descriptions, error messages, and schemas?
.describe() annotations beyond property names?If you have access to the codebase, also check:
tools/list output. Note description length (< 3 sentences = likely smells), whether parameters have descriptions, whether tools cross-reference each other, whether limitations are concrete or vague.outputSchema? Without it, clients and LLMs can't validate returned data./Error\s+at\s/)? Are error messages > 20 characters? Do they suggest recovery actions?Assess the quality of the AI integration design across seven areas. For each, assign a status:
| Status | Meaning |
|---|---|
| Not designed | No intentional design effort in this area |
| Partially designed | Some attention paid but significant gaps remain |
| Well-designed | Follows established design principles; low risk of FM-facing issues |
| Signal | Indicates |
|---|---|
| Descriptions < 4 sentences | Missing substantive content — likely no Usage Guidelines or Limitations |
| No cross-references between similar tools | Missing inter-tool disambiguation |
| Limitations are vague ("may not work in all cases") | Worse than no limitations — degrades SR by up to 10pp |
| Signal | Indicates |
|---|---|
No .describe() annotations beyond property name | Opaque Parameters (84.3% prevalence) |
| No output schemas declared | MCP spec violation; clients can't validate results |
| Tools with > 20 parameters | Out of model distribution — expect degradation |
| Signal | Indicates |
|---|---|
| Stack traces in error output | Internal details leak to FM — anti-pattern |
| Error messages < 20 characters | Too terse for FM to act on |
| No recovery action in errors | FM can't determine what to try next |
| Signal | Indicates |
|---|---|
| Domain knowledge duplicated in prompt and tool descriptions | Capability overlap — FM may answer from prompt instead of calling tool |
| System prompt > 2000 tokens with > 10 tools | Context pressure risk — tool definitions may already consume 50K+ tokens |
| Signal | Indicates |
|---|---|
| No testing at conversation depths > 5 turns | Context pressure effects unknown |
| Tool results include full data dumps | Verbose results accelerate context pressure |
| Entity references use display names, not IDs | Coreference resolution will fail |
| Signal | Indicates |
|---|---|
| > 20 tools without dynamic discovery | Context overload — 85% token reduction possible with dynamic loading |
| Overlapping tools don't reference each other | FM will confuse similar tools |
| Multi-intent tools with mode parameters | Should be separate tools |
| Signal | Indicates |
|---|---|
| Different field names for same concept across tools | Inconsistent schemas degrade FM parsing |
| No pagination for large result sets | Context will fill up in multi-turn conversations |
| Results contain narrative instead of discrete facts | Can't run claim decomposition for verification |
Map current testing against the five areas. For each, assign a status:
| Status | Meaning |
|---|---|
| Not tested | No tests exist for this area |
| Partially tested | Some tests exist but significant gaps remain |
| Adequately tested | Core scenarios covered; diminishing returns from more investment |
| Over-invested | More testing effort than impact justifies (e.g., LLM tests in CI) |
| Signal | Indicates |
|---|---|
| Descriptions < 3 sentences | Rubric score < 3 on Length; likely missing Usage Guidelines, Limitations, Parameters |
No parameter .describe() annotations | Opaque Parameters smell (84.3% prevalence). FM defaults to overly broad argument values. |
| Tools with overlapping names don't reference each other | Missing inter-tool disambiguation. FM will confuse similar tools. |
| No structural checks in CI | Description changes ship without quality verification. |
| Limitations are vague or self-referential | Worse than no Limitations — empirically degrades SR by up to 10pp. |
| No FM-scored rubric evaluation | Semantic quality is unmeasured; structural checks are necessary but insufficient. |
| Signal | Indicates |
|---|---|
| No test scenarios with expected tool calls | Tool selection quality is unmeasured. |
| Tests assert exact tool call sequences | Overly rigid — "agents regularly find valid approaches that eval designers didn't anticipate." Grade outcomes, not paths. |
| LLM tests run in CI | Too expensive, too slow, too flaky. Should be on-demand. |
| No multi-step workflow scenarios | Sequencing and context carry-forward are untested. |
| No recorded replay infrastructure | Missing the CI-safe middle pyramid layer. No regression detection for tool selection. |
| No multi-model testing | Tool descriptions may work for one FM but fail for another. Risk for production with diverse FM users. |
| Not tracking Average Steps (AS) | Step efficiency and cost impact are invisible. Steps increase 67.46% with augmentation. |
| Signal | Indicates |
|---|---|
Tools don't declare outputSchema | MCP spec MUST/SHOULD violation. Clients and LLMs can't validate returned data. |
| No error-path tests | FM will receive raw exceptions or opaque errors it can't act on. |
| No golden-file / snapshot tests | Result format changes go undetected until they break downstream. |
| Error messages contain stack traces | Internal implementation details leak to the FM. Anti-pattern: /Error\s+at\s/ |
| No FM recovery testing | Error messages may be structurally present but useless to the FM. |
| Tests don't isolate state between runs | Shared state causes correlated failures. Each trial needs clean environment. |
| Signal | Indicates |
|---|---|
| No seeded test data | Can't verify answers against ground truth. |
| No end-to-end tests at all | The scoreboard is unmeasured — "a system can score well on the first three and still produce a hallucinated answer." |
| Tests check tool calls but not the final answer | Agent behavior is tested but response synthesis is not. |
| No faithfulness checking | Hallucinated claims go undetected. |
| No claim decomposition | Faithfulness checking without atomic claim verification is insufficient. |
| Golden answers are vague or qualitative only | Can't run Tier 1 code-graded correctness checks without specific extractable values. |
| No eligibility check | Response may be grounded but fail to answer the user's actual question. |
| Signal | Indicates |
|---|---|
| No multi-turn test scenarios | Conversational failure modes invisible. "Better single-turn performance does not guarantee better multi-turn performance." |
| No coreference resolution tests | Indirect references ("that one") may silently corrupt tool arguments — hardest to detect. |
| No context pressure testing at multiple depths | Quality at turn 15 is unknown. Context length alone causes 13.9%–85% degradation. |
| No graceful degradation testing | Server failures, timeouts, rate limits unhandled. No failure injection for MCP transport layer. |
| No system prompt conflict testing | Tool descriptions may collide with system prompts or other MCP servers' tools. |
| No workflow interruption testing | Prior context may be lost when user interrupts and resumes a multi-step workflow. |
Ordered by impact / effort, interleaving design and testing recommendations:
Tool description design improvements — Highest impact, lowest cost. Well-designed descriptions are the single most effective lever for FM tool selection. Fix descriptions before testing anything else.
Error message design — Fix error messages so the FM can recover from failures. Actionable errors with recovery suggestions directly reduce retry loops and cascading failures.
Tool Description Quality structural checks (Tier 1) — Near-zero cost, CI-safe, catches the root cause of most downstream failures. 97.1% of descriptions have smells. Validates that description design improvements are maintained.
Server Correctness deterministic tests (Tier 1) — Schema validation, error structure, golden-file assertions, error-path coverage. Near-zero cost, CI-safe. Standard software engineering that's often missing in AI projects.
Parameter & schema design review — Ensure parameters have annotations, output schemas are declared, and tool parameter counts stay within model distribution. Design review, not code changes.
Response Accuracy Tier 1 (code-graded correctness) — Seed known data, run the loop, check counts/IDs/statuses with exact match. Moderate setup cost but catches the most damaging errors.
Remaining design areas — System prompt design (check for capability overlap and context pressure), multi-turn conversation design (result verbosity, entity references), tool set architecture (dynamic discovery, tool overlap), response format design (schema consistency, pagination).
Agent Behavior recorded replay — If live evaluation has been run at least once, set up recorded replay for CI. Deterministic regression detection for tool selection without ongoing LLM costs.
Agent Behavior live scenarios — Design 3–5 scenarios per tool. On-demand, not CI. Run 5–10 times, aggregate. Validates that description improvements actually helped.
Response Accuracy Tier 2 (LLM-graded faithfulness/completeness) — Claim decomposition and NLI. On-demand. Reserve for pre-release validation.
Chatbot Integration — Multi-turn scenarios. Most expensive. Reserve for production deployment readiness. Start with coreference resolution (highest impact chatbot-specific test).
| Symptom | Prioritize |
|---|---|
| "The FM keeps picking the wrong tool" | Tool Description Design + Quality checks (check rubric) + Agent Behavior scenarios |
| "Answers are wrong even though the right tool was called" | Response Accuracy (both tiers — correctness then faithfulness) |
| "Quality is fine in single queries but degrades in conversation" | Multi-Turn Conversation Design + Chatbot Integration (coreference + context pressure first) |
| "Our server crashes on unexpected input" | Server Correctness (error-path coverage + error structure) |
| "Tests are flaky and expensive" | Move LLM tests out of CI; add deterministic base layer; consider recorded replay |
| "The FM can't recover from errors" | Error Message Design + Server Correctness (error actionability) + FM recovery testing |
| "Works with Claude but not GPT" (or vice versa) | Agent Behavior multi-model stability testing |
| "System prompt conflicts with tool behavior" | System Prompt Design + Chatbot Integration (system prompt conflict testing) |
| "Too many tools, FM gets confused" | Tool Set Architecture (dynamic discovery, tool deduplication) |
Present the assessment as:
After assessment:
agent-artifex:design to learn the principles, then agent-artifex:implement to apply themagent-artifex:implementagent-artifex:learn