Help us improve
Share bugs, ideas, or general feedback.
From agent-creator
Interactive builder for creating, testing, and iteratively improving custom Claude Code agents. Use this skill when the user asks to "create an agent", "build me an agent", "make an agent for X", "I want an agent that...", "create a domain expert", "make a conversational agent", "agent builder", "new agent", or describes any persona they want to converse with or delegate tasks to via `claude --agent`. Also use when the user wants to turn a workflow into a reusable agent, test an existing agent, improve an agent's behavior, or asks about creating agents for specific topics. Even if the user doesn't say "agent" explicitly — if they say something like "I want to be able to talk to Claude about security anytime" or "can I have a Go expert on demand" or "my agent isn't behaving right", this skill applies.
npx claudepluginhub lgbarn/agent-creator --plugin agent-creatorHow this skill is triggered — by the user, by Claude, or both
Slash command
/agent-creator:agent-creatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build, test, and iteratively improve custom Claude Code agents through a guided workflow. Agents are markdown files with YAML frontmatter that define a specialized Claude persona — started with `claude --agent name` for a full session, or invoked mid-conversation with `@agent-name`.
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
Share bugs, ideas, or general feedback.
Build, test, and iteratively improve custom Claude Code agents through a guided workflow. Agents are markdown files with YAML frontmatter that define a specialized Claude persona — started with claude --agent name for a full session, or invoked mid-conversation with @agent-name.
The plugin root is the parent of this skill's directory. Scripts are at ../../scripts/ relative to this file, agents at ../../agents/, and references at ../../references/.
Before starting, understand which type the user needs:
Conversational Agents — Domain experts the user talks to. The system prompt defines persona, expertise boundaries, and conversation style. Tools are minimal (often just Read, Grep, Glob or none). The user starts a session and has an ongoing dialogue.
Task Agents — Automated workers that perform specific actions. The system prompt defines procedures, verification steps, and output formats. Tools include Bash, Write, Edit, etc.
Most users asking to "create an agent" want a conversational agent. If the request sounds like "I want an expert in X", default to conversational unless they describe specific automated actions.
The full workflow is 8 steps: design (1-6), test (7), iterate (8). Adapt based on how much the user already knows — skip steps when intent is clear.
Figure out what the agent should do. Ask:
If the request is already clear (e.g., "make me a Go expert agent"), acknowledge and move on.
| Location | Path | When to use |
|---|---|---|
| Personal | ~/.claude/agents/ | Available across all projects. Default for conversational domain experts. |
| Project | .claude/agents/ | Tied to current repo, version-controlled. Default for task agents specific to a codebase. |
Create the directory if it doesn't exist (mkdir -p). Default to personal for conversational, project for task agents.
Decide on frontmatter fields. Offer sensible defaults — only ask about a field when the choice is genuinely ambiguous. Read references/agent-frontmatter-reference.md for details on any field.
Smart defaults by archetype:
| Field | Conversational Default | Task Default |
|---|---|---|
name | Derive from domain: go-expert, security-advisor | Derive from action: code-reviewer, test-generator |
model | inherit (or opus if deep reasoning needed) | inherit (or sonnet for fast iteration) |
color | Match domain semantics (see below) | Match domain semantics |
tools | Read, Grep, Glob (read-only) | Select based on what it does |
maxTurns | Omit (unlimited conversation) | 15-30 based on task complexity |
memory | Consider user if the agent should remember across sessions | Omit unless stateful |
Color semantics: blue/cyan — analysis, research. green — creation, building. yellow — validation, teaching. red — security, critical. magenta — creative, writing, design.
The system prompt defines everything about the agent's behavior. Read references/system-prompt-patterns.md for detailed patterns.
For conversational agents, use this structure:
<role>
Who the agent is, their expertise, and personality traits.
Be specific: not "you know about security" but "you specialize in
application security, OWASP Top 10, threat modeling, and secure code
review in Python and Go."
</role>
<knowledge>
What the agent knows deeply, and where its boundaries are.
Include when it should say "I don't know" or recommend external resources.
</knowledge>
<style>
How the agent communicates:
- Socratic (asks questions to guide understanding)
- Direct (gives clear answers with rationale)
- Teaching (explains concepts, builds mental models)
- Advisory (weighs tradeoffs, recommends approaches)
</style>
<rules>
Behavioral boundaries:
- What the agent should NOT do
- When to escalate or defer
- Scope limits
</rules>
For task agents, use:
<role>
Expertise and function. What this agent does and why it exists.
</role>
<instructions>
Step-by-step procedure the agent follows.
Include verification steps and output format.
</instructions>
<rules>
Hard constraints: MUST / MUST NOT rules, quality standards, when to stop.
</rules>
Key principles:
Check references/agent-templates.md if the user's request matches a common pattern — adapt a template rather than writing from scratch.
For the description field: If the agent will only be used via claude --agent name, a simple one-line description is fine. If the user also wants it to auto-trigger when delegated by Claude in normal sessions, write a richer description with <example> blocks:
description: |
Use this agent when the user asks about application security,
threat modeling, or secure coding practices. Examples:
<example>
Context: User is reviewing code for security issues.
user: "Can you check this auth flow for vulnerabilities?"
assistant: "I'll use the security-advisor agent to review this."
<commentary>Security review is this agent's core expertise.</commentary>
</example>
After saving, tell the user:
claude --agent agent-name@agent-name followed by a task.md file directly anytimeThis step verifies the agent works as intended. Run the validation script first, then create test scenarios.
Validate the agent file:
python <plugin-root>/scripts/validate_agent.py <path-to-agent.md>
This checks frontmatter fields, system prompt structure, and flags issues. Fix any errors before proceeding.
Create test scenarios: Write 2-3 realistic test scenarios — the kind of thing a real user would actually say to this agent. Include:
Save scenarios as JSON following the format in references/schemas.md. See evals/example-scenario.json for a complete example.
Each scenario has:
turns: multi-turn conversation with per-turn assertionsglobal_assertions: behavioral checks across the full conversationcontains, not_contains, regex (programmatic), behavioral, tone (graded by behavior-grader agent)Run the tests:
python <plugin-root>/scripts/run_agent_test.py \
--agent <agent-name-or-path> \
--scenario <scenario.json> \
--output-dir <workspace>/iteration-1/ \
--runs 3
The --runs flag runs each scenario N times for statistical reliability (default 1, use 3+ for benchmarking). Results go into run-1/, run-2/, etc. subdirectories with an aggregate.json summary.
Review the transcript and assertion results. For behavioral assertions that need judgment-based grading, spawn the behavior-grader agent:
Read agents/behavior-grader.md for the grader's instructions. Spawn it with:
transcript_path: path to the generated transcript.mdscenario_path: path to the test scenario JSONoutput_path: where to write grading.jsonThe grader now also:
Review results interactively:
python <plugin-root>/eval-viewer/generate_review.py <workspace>/iteration-1/ \
--agent-name <name>
This launches a browser-based viewer showing each scenario's multi-turn transcript, assertions, grading results, and extracted claims. You can:
feedback.json for the improvement loopFor environments without a browser, use --static report.html to write a standalone HTML file.
If tests reveal issues, improve the agent and re-test.
Quick iteration: Edit the agent file directly based on test feedback, then re-run Step 7.
Automated improvement loop (for more thorough optimization):
python <plugin-root>/scripts/run_loop.py \
--agent <path-to-agent.md> \
--scenarios <scenarios.json> \
--output-dir <workspace>/ \
--max-iterations 5 \
--holdout 0.3 \
--runs 3 \
--model sonnet \
--verbose
This runs the test-improve cycle automatically: test → identify failures → use Claude to improve the prompt → re-test → repeat. It saves the best-performing version.
Key flags:
--holdout 0.3: Hold out 30% of scenarios as a test set. Prevents overfitting. Best version is selected by test score.--runs 3: Run each scenario 3 times per iteration for statistical reliability.--no-baseline: Skip running the original agent for comparison (on by default). Baseline runs give you a delta showing improvement.Feedback integration: If feedback.json exists in the output directory (from the eval viewer), the improvement loop automatically incorporates human feedback. Human judgment takes priority over automated assertions — this is the recommended workflow:
python <plugin-root>/eval-viewer/generate_review.py <workspace>/iteration-1/Generate HTML reports to visualize progress across iterations:
python <plugin-root>/scripts/generate_report.py <workspace>/loop_results.json -o report.html
For live monitoring during active loops, add --auto-refresh — the page reloads every 10 seconds.
Aggregate benchmark statistics across multiple runs:
python <plugin-root>/scripts/aggregate_benchmark.py <workspace>/ --agent-name <name>
Produces benchmark.json and benchmark.md with mean±stddev for pass rates, timing, and token usage. Feed this into the eval viewer's Benchmark tab.
For deeper analysis, use the evaluation agents:
agents/agent-comparator.md): Blind A/B comparison of two agent versions. Generates a task-specific evaluation rubric tailored to the agent's domain (not a generic rubric), then scores both versions.agents/agent-analyzer.md): Post-hoc analysis of why one version performed better, with prioritized improvement suggestionsFor trigger testing (agents using <example> blocks for auto-delegation):
python <plugin-root>/scripts/run_trigger_eval.py \
--agent <path-to-agent.md> \
--eval-set <trigger-queries.json> \
--runs-per-query 3 \
--verbose
For trigger eval review (reviewing/editing queries before running optimization):
<plugin-root>/assets/eval_review.html__AGENT_NAME_PLACEHOLDER__ with the agent name__AGENT_DESCRIPTION_PLACEHOLDER__ with the current description__EVAL_DATA_PLACEHOLDER__ with the JSON eval set arrayeval_set.jsonThe trigger eval set is a JSON array of queries with should_trigger flags:
[
{"query": "can you review this auth flow for security issues?", "should_trigger": true},
{"query": "write a fibonacci function in Python", "should_trigger": false}
]
This tests whether Claude correctly delegates to the agent for matching queries and avoids delegating for non-matching ones. Only relevant for agents that use description-based auto-triggering — agents invoked solely via claude --agent don't need this.
For description optimization: The improvement loop supports --mode description to iteratively optimize the description field instead of the system prompt.
Keep iterating until:
Write access means it might try to create files when the user just wants to talk about cookingmemory: userAll scripts are in <plugin-root>/scripts/:
| Script | Purpose | Usage |
|---|---|---|
validate_agent.py | Validate agent .md file structure and fields | python scripts/validate_agent.py <agent.md> |
run_agent_test.py | Run test scenarios and capture transcripts | python scripts/run_agent_test.py --agent <name> --scenario <file> --output-dir <dir> --runs 3 |
run_loop.py | Iterative test-improve cycle with train/test split, baseline, feedback | python scripts/run_loop.py --agent <path> --scenarios <file> --output-dir <dir> --holdout 0.3 --runs 3 |
run_trigger_eval.py | Test agent description triggering accuracy | python scripts/run_trigger_eval.py --agent <path> --eval-set <file> |
improve_prompt.py | Improve system prompt using Claude + extended thinking + feedback | python scripts/improve_prompt.py --agent <path> --grading <file> --model <model> |
aggregate_benchmark.py | Aggregate results into benchmark.json with mean±stddev | python scripts/aggregate_benchmark.py <workspace> --agent-name <name> |
generate_report.py | Generate visual HTML report of improvement loop progress | python scripts/generate_report.py <loop_results.json> -o report.html |
package_agent.py | Package agent into distributable .agent archive | python scripts/package_agent.py <agent.md> [output-dir] |
utils.py | Shared parsing utilities (imported by other scripts) | — |
Eval viewer (in <plugin-root>/eval-viewer/):
| Script | Purpose | Usage |
|---|---|---|
generate_review.py | Interactive browser-based transcript viewer with feedback collection | python eval-viewer/generate_review.py <workspace> --agent-name <name> |
viewer.html | HTML template for the eval viewer (embedded data, no external deps) | Used by generate_review.py |
Assets (in <plugin-root>/assets/):
| File | Purpose | Usage |
|---|---|---|
eval_review.html | Template for reviewing/editing trigger eval queries in browser | Replace placeholders, open in browser, export eval_set.json |
Located in <plugin-root>/agents/:
| Agent | Role | When to use |
|---|---|---|
behavior-grader.md | Grade transcripts against behavioral assertions, extract claims, critique evals | After running tests — evaluates persona, boundaries, style, catches hallucination |
agent-comparator.md | Blind A/B comparison with dynamic task-specific rubrics | When comparing before/after an improvement |
agent-analyzer.md | Post-hoc analysis with improvement suggestions | After comparison — extracts actionable changes |