From agent-artifex
Use when the user wants to improve an existing MCP server, agent, chatbot, or tool-calling system. This includes: improving tool descriptions, fixing error messages, adding output schemas, writing tests, implementing quality checks, adding evals, setting up test harnesses, or any task where they say "help me improve", "fix my descriptions", "add tests", "write evals", "implement quality checks", "make my server better", "apply the design principles", or are ready to make code changes to improve quality. This skill covers both design application (making the code better) and test implementation (verifying the code is good). For scaffolding new projects, use claude-api:mcp-builder. For design principles without code changes, use agent-artifex:design.
npx claudepluginhub flexion/claude-domestique --plugin agent-artifexThis skill uses the workspace's default tool permissions.
This is the hands-on improvement skill. It covers both applying design principles to make code better AND writing tests to verify code quality. Use it whenever the user is ready to make changes — whether that means rewriting tool descriptions, restructuring error messages, adding output schemas, writing test harnesses, or building eval pipelines.
Applies Acme Corporation brand guidelines including colors, fonts, layouts, and messaging to generated PowerPoint, Excel, and PDF documents.
Builds DCF models with sensitivity analysis, Monte Carlo simulations, and scenario planning for investment valuation and risk assessment.
Calculates profitability (ROE, margins), liquidity (current ratio), leverage, efficiency, and valuation (P/E, EV/EBITDA) ratios from financial statements in CSV, JSON, text, or Excel for investment analysis.
This is the hands-on improvement skill. It covers both applying design principles to make code better AND writing tests to verify code quality. Use it whenever the user is ready to make changes — whether that means rewriting tool descriptions, restructuring error messages, adding output schemas, writing test harnesses, or building eval pipelines.
Cross-references:
claude-api:mcp-builderagent-artifex:designagent-artifex:assessStart by understanding what the user needs:
agent-artifex/skills/design/references/)references/)Then read the relevant reference files before writing any code.
Read these when making code changes to improve quality. Each file contains principles, patterns, anti-patterns, and concrete guidance for one design area.
| Design Area | Reference File | What it contains |
|---|---|---|
| Tool Description Design | agent-artifex/skills/design/references/tool-descriptions.md | Six-component rubric, structural markers, augmentation patterns, domain-specific guidance |
| Parameter & Schema Design | agent-artifex/skills/design/references/parameter-schema.md | .describe() patterns, output schema design, argument count guidance, naming conventions |
| Error Message Design | agent-artifex/skills/design/references/error-messages.md | Problem/input/why/recovery structure, anti-patterns, isError usage, cross-references in recovery |
| System Prompt Design | agent-artifex/skills/design/references/system-prompts.md | Knowledge placement, ordering instructions, prompt sizing, collision avoidance |
| Multi-Turn Conversation Design | agent-artifex/skills/design/references/multi-turn.md | Result trimming, stable ID patterns, pagination, context pressure mitigation |
| Tool Set Architecture | agent-artifex/skills/design/references/tool-set-architecture.md | Dynamic discovery, cross-references, tool splitting, token footprint management |
| Response Format Design | agent-artifex/skills/design/references/response-format.md | Field naming, pagination patterns, fact vs. narrative, schema consistency |
Read these when writing test code, assertions, harness setup, or eval pipelines. Each file contains working code examples, prompt templates, regex patterns, and pass/fail criteria.
| Testing Area | Reference File | What it contains |
|---|---|---|
| Tool Description Quality | references/tool-descriptions.md | Tier 1 code examples (all 5 checks with regex), Tier 2 FM scoring prompt template, multi-model jury setup, pass/fail criteria |
| Server Correctness | references/server-correctness.md | Schema validation (Ajv/jsonschema), error anti-pattern regex, golden-file patterns, FM recovery 4-step procedure |
| Agent Behavior | references/agent-behavior.md | Scenario design with examples, recorded replay (TestProvider pattern), live evaluation 4-step procedure, grading guidance |
| Response Accuracy | references/response-accuracy.md | Closed-loop harness 5 steps, claim decomposition with LLM prompt templates, DeepMind FACTS two-phase evaluation |
| Chatbot Integration | references/chatbot-testing.md | 5 coreference categories, 5 workflow patterns, 6 scenario categories, 4 conflict types, 6 degradation failure modes |
The canonical source documents with full evidence and footnotes are in docs/ai-services/.
What to look for:
What to change:
tool_x instead when [condition]" or "Often used after tool_y"How to verify:
What to look for:
.describe() annotations on Zod schemas or missing description fields in JSON SchemaoutputSchema declared (server returns unstructured text only)data, input, value, options without clarifying descriptionsWhat to change:
outputSchema declarations so servers return structuredContentHow to verify:
inputSchema.properties entries have non-empty, non-trivial descriptionsoutputSchema is declared and structuredContent conforms to itWhat to look for:
Error at, at function_name ()TypeError:, ReferenceError:)What to change:
tool_x with [adjusted args]"isError: true on all error responses so the FM knows the call failedHow to verify:
/Error\s+at\s/, /at\s+\w+\s+\(/, /^(TypeError|ReferenceError|Error):/What to look for:
query toolWhat to change:
tool_a before tool_b when [condition]"How to verify:
What to look for:
What to change:
How to verify:
What to look for:
What to change:
tool_x for [case A], tool_y for [case B]"How to verify:
What to look for:
name in one tool, title in another for the same concept)What to change:
How to verify:
Before writing code, read: references/tool-descriptions.md
What to implement:
| Test | Tier | CI | What it catches |
|---|---|---|---|
| Description presence and length (>= 3 sentences) | 1 | Yes | Missing or minimal descriptions |
| Rubric component markers (regex) | 1 | Yes | Missing usage guidelines (89.3%), limitations (89.8%), examples |
| Parameter descriptions (non-empty, non-trivial) | 1 | Yes | Opaque parameters (84.3% prevalence) |
| Inter-tool disambiguation (cross-references) | 1 | Yes | Confusable tools without cross-references |
| Limitation quality guard (anti-pattern regex) | 1 | Yes | Vague limitations that hurt more than help (-10pp SR) |
| FM-scored rubric evaluation (multi-model jury) | 2 | No | Semantic quality below threshold (mean < 3 on any component) |
Retrieve tool definitions: Call tools/list or extract from registration code as static JSON.
Tier 1 pass criteria: All structural checks pass. A Tier 1 failure guarantees a rubric smell; passing does not guarantee the rubric score is >= 3.
Tier 2 pass criteria: All six rubric component means >= 3 across a 3-model jury. Smell detected if and only if (1/N) x Sum Score_i < 3.
When to add Tier 2: After Tier 1 passes but Agent Behavior tests regress, before releases, or periodically as audit.
Before writing code, read: references/server-correctness.md
What to implement:
| Test | Tier | CI | What it catches |
|---|---|---|---|
Schema validation (outputSchema -> structuredContent) | 1 | Yes | Format violations. MCP spec: servers MUST conform. |
| Error structure (actionable, no stack traces) | 1 | Yes | Opaque errors -> FM can't recover. RFC 9457 principle. |
| Result fidelity (golden-file / snapshot) | 1 | Yes | Silent changes to result shapes. Contract testing. |
| Error-path coverage (invalid input, not-found) | 1 | Yes | Crashes or success responses for invalid input. |
| FM recovery rate (LLM in loop) | 2 | No | Error messages the FM can't act on. |
Minimum test cases per tool: 1 happy path + 1 invalid input + 1 not-found = 3 cases.
Key anti-pattern regex (error structure):
/Error\s+at\s/ and /at\s+\w+\s+\(//^(TypeError|ReferenceError|Error):/Golden-file non-deterministic fields: Assert patterns, not values. UUID: /^[0-9a-f-]{36}$/. ISO date: /^\d{4}-\d{2}-\d{2}/. Assert relationships (count field = array length).
Before writing code, read: references/agent-behavior.md
What to implement:
| Test | Tier | CI | What it catches |
|---|---|---|---|
| Recorded scenario replay (TestProvider pattern) | Middle | Yes (after recording) | Regressions in tool selection |
| Tool selection accuracy (live FM evaluation) | Upper | No | Wrong tool selected for a query |
| Argument quality (live) | Upper | No | Wrong or hallucinated parameters |
| Step efficiency (live) | Upper | No | Excessive tool call loops |
Three metrics (compute per scenario, aggregate across runs):
SR = (tasks where all evaluators pass) / (total tasks) x 100
AE = (1/N) x Sum (evaluators_passed_i / total_evaluators_i)
AS = (1/N) x Sum steps_i
Statistical guidance: Run each scenario 5-10 times. Report median and IQR. Compare configurations with McNemar's test (SR) or Wilcoxon signed-rank (AE, AS). Report effect sizes, not just p-values.
Five scenario categories: (1) single-tool, (2) multi-step workflows, (3) ambiguous queries (highest-value — exercise description disambiguation), (4) negative cases, (5) edge-case arguments.
Aim for: 3-5 scenarios per tool + 2-3 multi-tool workflows. A 5-tool server benefits from 20-30 scenarios.
Grade outcomes, not paths. Don't assert exact tool call sequences. Agents regularly find valid approaches that eval designers didn't anticipate.
Before writing code, read: references/response-accuracy.md
What to implement:
| Test | Tier | CI | What it catches |
|---|---|---|---|
| Correctness (counts, IDs, statuses — code-graded) | 1 | Yes | Wrong facts in the answer |
| Faithfulness (claim decomposition — LLM-graded) | 2 | No | Hallucinated claims not in tool results |
| Completeness (golden answer coverage — LLM-graded) | 2 | No | Missing important facts |
| Grounding traceability (attribution — LLM-graded) | 2 | No | Unattributable claims |
Closed-loop harness (5 steps): Seed data -> Define scenario (query + seed state + golden answer + grading mode) -> Execute full MCP loop -> Capture both layers (tool results + FM answer) -> Grade.
Key formulas:
Faithfulness = (claims supported by tool results) / (total claims in response)
Completeness = (golden claims covered by response) / (total golden claims)
Faithfulness checks "did the response hallucinate?" against tool results. Completeness checks "did the response omit?" against the golden answer. They are independent dimensions — a response can be perfectly faithful but incomplete, or vice versa.
Tier 1 pass criteria: All extracted facts (counts, IDs, statuses, entities) match golden answer exactly. Negation consistency: zero-match seed data -> answer must not fabricate results.
Tier 2 pass criteria: Faithfulness score = 1.0 (all claims SUPPORTED). Use DeepMind FACTS two-phase evaluation: eligibility (does it answer the query?) then grounding (is it factually grounded?). For high-stakes: multi-judge with 2-3 LLMs, majority verdict per claim.
Before writing code, read: references/chatbot-testing.md
What to implement:
| Test | Tier | CI | What it catches |
|---|---|---|---|
| Coreference resolution (5 reference types) | Code-graded | No | Indirect references -> wrong argument values |
| Workflow tool sequences (5 workflow patterns) | Code-graded | No | Multi-turn workflows broken |
| Context pressure (turn 1 vs turn 5/10/15) | Code-graded | No | Quality degradation at conversation depth |
| System prompt conflict (4 conflict types) | FM-graded | No | Tool description collisions with system prompt |
| Presentation quality (anti-pattern detection) | FM-graded | No | Raw JSON dumps, over-summarization |
| Graceful degradation (6 failure modes) | Hybrid | No | Hallucinated results after server failure |
Key metrics:
CRR = (correctly resolved references) / (total references) x 100
WCR = (completed workflows) / (total workflow scenarios) x 100
DASR(N) = SR at turn N; Degradation = SR(turn 1) - SR(turn N)
Reliability = (function_name_recall + function_argument_recall) / 2
Coreference categories to cover: result reference ("that one"), argument echo ("same but for Q4"), implicit context ("make sure it's well-written"), negation reference ("the other one"), plural accumulation ("all three").
Context pressure evidence: Context length alone causes 13.9%-85% performance degradation. Tool definitions can consume 50K-134K tokens. Test the same scenario at turns 1, 5, 10, 15+.
Starting point: 2-3 coreference scenarios + 1 workflow scenario + 1 pressure scenario.
Diagnostic flow: If chatbot tests fail but single-turn tests pass -> conversational problem (context, coreference, orchestration). If both fail -> fix single-turn first.
| CI (every commit) | On-demand (before releases) |
|---|---|
| Tool description structural checks (Tier 1) | Agent behavior live FM evaluation |
| Server schema validation | Response accuracy Tier 2 (faithfulness/completeness) |
| Server error-path coverage | Chatbot integration scenarios |
| Server golden-file assertions | Multi-model stability testing |
| Response accuracy Tier 1 correctness | FM-scored rubric evaluation (Tier 2) |
Key principle: "Don't run live LLM tests in CI. Too expensive, too slow, too flaky." (Block Engineering)
Key principle: "Prefer deterministic graders where possible; use LLM graders where necessary." (Anthropic)
| Area | Minimum |
|---|---|
| Tool Description Quality | All tools (structural checks iterate automatically) |
| Server Correctness | 3 cases per tool (happy, invalid, not-found) |
| Agent Behavior | 3-5 scenarios per tool + 2-3 multi-tool workflows |
| Response Accuracy | 1 closed-loop scenario per critical user journey |
| Chatbot Integration | 2-3 coreference + 1 workflow + 1 pressure scenario |
When response accuracy is low, trace through the chain:
Tool Description Quality -> Agent Behavior -> Server Correctness -> Response Accuracy
Did the FM pick the right tool? (Agent Behavior) -> Were the arguments correct? -> Did the server return correct results? (Server Correctness) -> Did the FM synthesize faithfully? (Response Accuracy Tier 2)
After implementing improvements and tests:
agent-artifex:assess periodically to re-evaluate coverage as the project evolves