From agents
Prompt engineering. Craft, analyze, harden, convert, design tool prompts, and build PromptOps/eval plans. Use for system, agent, tool, RAG prompts. NOT for running prompts or building agents.
npx claudepluginhub wyattowalsh/agents --plugin agentsThis skill uses the workspace's default tool permissions.
Comprehensive prompt and context engineering. Every non-obvious recommendation must be evidence-scoped.
Creates isolated Git worktrees for feature branches with prioritized directory selection, gitignore safety checks, auto project setup for Node/Python/Rust/Go, and baseline verification.
Executes implementation plans in current session by dispatching fresh subagents per independent task, with two-stage reviews: spec compliance then code quality.
Dispatches parallel agents to independently tackle 2+ tasks like separate test failures or subsystems without shared state or dependencies.
Comprehensive prompt and context engineering. Every non-obvious recommendation must be evidence-scoped.
Use these terms exactly throughout all modes:
| Term | Definition |
|---|---|
| system prompt | The top-level instruction block sent before user messages; sets model behavior |
| context window | The full token budget: system prompt + conversation history + tool results + retrieved docs |
| context engineering | Designing the entire context window, not just the prompt text — write, select, compress, isolate |
| template | A reusable prompt structure with variable slots ({{input}}, named placeholders, or runtime arguments) |
| rubric | A scoring framework with dimensions, levels (1-5), and concrete examples per level |
| few-shot example | An input/output pair included in the prompt to demonstrate desired behavior |
| chain-of-thought (CoT) | Explicit step-by-step reasoning scaffolding; beneficial for instruction-following models, harmful for reasoning models |
| model class | Either "instruction-following" or "reasoning" — determines which techniques apply |
| injection | Untrusted input that manipulates model behavior outside intended boundaries |
| anti-pattern | A prompt construction that reliably degrades output quality |
| over-specification | Adding constraints past the point where they help; too much specificity can degrade output quality and flexibility |
| scorecard | The 5-dimension diagnostic (Clarity, Completeness, Efficiency, Robustness, Model Fit) scored 1-5 |
| playbook | Model-family-specific guidance document in references/model-playbooks.md |
| prefix caching | Cost optimization by placing static content early so API providers cache the prefix |
| trust boundary | The separation between trusted instructions and untrusted user, tool, retrieved, or external content |
| evidence class | Source type behind a recommendation: official docs, peer-reviewed/preprint, community heuristic, local practice, or single-study |
| PromptOps | Versioning, evals, rollout checks, observability, and rollback discipline for prompts in production |
| $ARGUMENTS | Action |
|---|---|
craft <description> | → Mode A: Craft a new prompt from scratch |
analyze <prompt or path> | → Mode B: Analyze and improve an existing prompt |
audit <prompt or path> | → Mode B: Analyze, report only (no changes) |
convert <source-model> <target-model> <prompt or path> | → Mode C: Convert between model families |
evaluate <prompt or path> | → Mode D: Build evaluation framework |
harden <prompt or path> | → Mode E: Security and robustness hardening |
tool <tool definition or schema> | → Mode F: Design or review model-facing tool definitions |
promptops <prompt or path> | → Mode G: Versioning, rollout, observability, and rollback plan |
| Raw prompt text (XML tags, role definitions, multi-section structure) | → Auto-detect: Mode B (Analyze, report only) |
| Natural-language request describing desired behavior | → Auto-detect: Mode A (Craft) |
| Empty / no args | Show mode menu with examples |
If no explicit mode keyword is provided:
<system>, <instructions>), role definitions (You are..., Act as...), instruction markers (## Instructions, ### Rules), or multi-section structure → existing prompt → Analyze, report only (Mode B)/prompt-engineer craft a system prompt for a RAG customer support agent on Claude
/prompt-engineer analyze ./prompts/system.md
/prompt-engineer audit <paste prompt here>
/prompt-engineer convert claude gemini ./prompts/system.md
/prompt-engineer evaluate ./prompts/agent-system.md
/prompt-engineer harden ./prompts/rag-system.md
/prompt-engineer tool ./tools/search-docs.json
/prompt-engineer promptops ./prompts/support-agent.md
When $ARGUMENTS is empty, present the mode menu:
| Mode | Command | Purpose |
|---|---|---|
| Craft | craft <description> | Build a new prompt from scratch |
| Analyze | analyze <prompt> | Diagnose and improve an existing prompt |
| Analyze (report only) | audit <prompt> | Read-only review for anti-patterns and security |
| Convert | convert <src> <tgt> <prompt> | Port between model families |
| Evaluate | evaluate <prompt> | Build test suite and evaluation rubric |
| Harden | harden <prompt> | Stress-test trust boundaries, injection resistance, and safety |
| Tool | tool <schema> | Design or review model-facing tool definitions |
| PromptOps | promptops <prompt> | Plan versioning, rollout, monitoring, and rollback |
Paste a prompt, describe what you need, or pick a mode above.
Non-negotiable constraints governing all modes. Violations are bugs.
Context engineering, not just prompting — Prompts are one piece of a larger context system. Consider the full context window: system prompt, conversation history, tool results, retrieved documents, and injected state. Most production failures are context failures, not prompt failures. Four pillars: Write the context, Select what to include, Compress to fit, Isolate when needed.
Model-class awareness — Instruction-following models and reasoning models
respond differently to the same techniques. Techniques that help
instruction-followers can hurt reasoning models in some tested settings, while
provider guidance can change by model and reasoning mode. Always detect model
class first and verify model-specific advice against references/model-playbooks.md.
Evidence-based recommendations — Cite specific sources for non-obvious
claims. Do not present anecdotal patterns as established best practice.
Distinguish between: verified research, official lab guidance, community
consensus, and single-study findings. Read references/model-playbooks.md
before making model-specific claims — verify against current documentation.
Empirical iteration — Prompts are hypotheses, not solutions. Every prompt needs testing against edge cases. The first draft is never the final version. Recommend eval frameworks for any non-trivial prompt.
Avoid over-specification — The Over-Specification Paradox (UCL, Jan 2026) is a useful heuristic, not a law. Once intent is clear, extra constraint language can reduce output quality or flexibility. Keep the task legible, trim redundant rules, and use the references for the numeric heuristic instead of hard-coding it into the default contract.
Mandatory first step for all modes. Determine the target model class before any analysis or generation. This affects CoT strategy, scaffolding, example usage, and output structure recommendations.
Heuristic: If the model has a native reasoning/thinking mode → Reasoning. Otherwise → Instruction-following. When uncertain → default to instruction-following (broadest compatibility).
Reasoning: Claude with extended thinking, GPT-5.5/GPT-5.x reasoning modes, Gemini thinking modes, o-series models, Llama reasoning variants Instruction-following: Claude 3.5 Sonnet/Haiku, GPT-4o/4.1, Gemini 2 Flash, Llama 4 standard
| Dimension | Instruction-Following | Reasoning |
|---|---|---|
| Chain-of-thought | Use concise reasoning scaffolding when it improves reliability | Do not require hidden reasoning transcripts; use provider-supported reasoning effort or short planning/preambles when current docs recommend them |
| Few-shot examples | Highly beneficial — provide 3-5 diverse examples | Minimal benefit — 1 example for format only, or zero-shot |
| Scaffolding | More structure improves output | Excessive structure constrains reasoning — provide goals, not steps |
| Prompt length | Longer prompts with details generally help | Concise prompts with clear objectives outperform verbose ones |
| Temperature | Task-dependent (0.0-1.0) | Often fixed internally; external temp has less effect |
Use this preflight for every mode before entering the mode-specific workflow.
$ARGUMENTS text or file path. If a file path is provided, read the file.references/context-management.md and surface the relevant caching, compaction, selection, or isolation decisions explicitly in the result.Build a new prompt from scratch. For when the user has no existing prompt.
Run Shared Preflight — Complete the shared preflight above, including scope and trust-boundary checks.
Requirements gathering — Ask targeted questions:
Architecture selection — Based on deployment context, select from references/architecture-patterns.md:
Context-management check — If the prompt is multi-turn, long-context, RAG, agent-loop, or cost-sensitive, read references/context-management.md and decide what should be cached, compacted, selected, or isolated.
Draft prompt — Write the prompt using the selected architecture. Apply model-class-specific guidance from references/model-playbooks.md. Use XML tags as a default starting point for multi-section prompts unless the target model or output contract suggests a better structure. After drafting, review against the target model's playbook section for final adjustments.
Structure for cacheability — Arrange content for prompt caching efficiency:
references/context-management.md before quoting cache economicsHarden — Run through references/hardening-checklist.md:
Present — Format per references/output-formats.md Annotated Prompt. Recommend Mode D (Evaluate) to build a test suite.
Diagnose an existing prompt and optionally improve it. Dispatched as analyze
(with fixes) or audit (report only, no changes).
Run Shared Preflight — Complete the shared preflight above before scoring or analyzing the prompt.
Diagnostic scoring — Score the prompt on 5 dimensions using the Diagnostic Scorecard in references/output-formats.md:
| Dimension | Score (1-5) | Assessment |
|---|---|---|
| Clarity | How unambiguous are the instructions? | |
| Completeness | Are all necessary constraints and context provided? | |
| Efficiency | Is every token earning its keep? (Over-specification check) | |
| Robustness | How well does it handle edge cases and adversarial inputs? | |
| Model Fit | Is it optimized for the target model class? |
Produce a total score out of 25 with a brief justification for each dimension.
Four-lens analysis — Examine the prompt through each lens:
references/hardening-checklist.md. Assess input trust boundaries. Check for information leakage risks.Anti-pattern scan — Check against every pattern in references/anti-patterns.md. For each detected anti-pattern, report: pattern name, severity, location in the prompt, and remediation guidance.
Model-fit validation — Assess whether the prompt is well-suited to its target model and verify recommendations are current:
references/model-playbooks.md for the target model and note the "last verified" dateReport-only mode (audit): Present findings per the Audit Report in references/output-formats.md. Recommend full Analyze if fixes are needed, and recommend Mode D (Evaluate) if no eval exists. Stop here.
Full mode (analyze): Continue with steps 6-7.
Apply improvements — For each dimension scoring below 4:
references/technique-catalog.md or references/anti-patterns.md)Present — Format the diagnosis with Diagnostic Scorecard and the proposed changes with Changelog from references/output-formats.md. Recommend Mode D (Evaluate) if no eval exists.
Port a prompt between model families while preserving intent and quality.
Run Shared Preflight — Complete the shared preflight above before building the conversion plan.
Load playbooks — Read the source and target model playbook sections from references/model-playbooks.md. Note key differences:
Build conversion plan — Create a conversion checklist:
Execute conversion — Apply the plan. For each change:
Validate — Run Mode B (Analyze) report-only analysis on the converted prompt to catch issues introduced during conversion. Present per the Conversion Diff in references/output-formats.md. Recommend Mode D (Evaluate) using the same test cases on both models.
Build an evaluation framework for a prompt. Does not run the evaluations — produces the eval design.
Run Shared Preflight — Complete the shared preflight above before defining the evaluation plan.
Define success criteria — Work with the user to define what "working correctly" means:
Design test suite — Create categories of test cases from references/evaluation-frameworks.md:
Generate test cases — For each category, produce concrete test cases:
Build rubric — Create a scoring rubric per the Evaluation Framework in references/output-formats.md:
Present — Format per the Evaluation Framework in references/output-formats.md. Include recommended eval tools from references/evaluation-frameworks.md and CI/CD integration pattern.
Stress-test and improve a prompt that handles untrusted input, tool results, retrieved documents, user-facing output, or production actions.
references/hardening-checklist.md and the security sections of references/architecture-patterns.md.Harden Report template. Include residual risks and eval cases that should fail before deployment.Design or review model-facing tool definitions, function schemas, MCP tool descriptions, and tool-selection instructions.
references/architecture-patterns.md and references/hardening-checklist.md.Tool Definition Review template with rewritten tool docs or schema deltas.Plan prompt lifecycle controls. Does not deploy or run prompts.
references/evaluation-frameworks.md, references/output-formats.md, and relevant provider playbook sections.PromptOps Plan template.| File | Content | Read When |
|---|---|---|
references/technique-catalog.md | ~36 techniques across 8 categories with model-class compatibility | Selecting techniques for any mode |
references/model-playbooks.md | Claude, GPT, Gemini, Llama guidance with caching strategies | Any model-specific recommendation |
references/anti-patterns.md | 14 anti-patterns with severity, detection, and remediation | Analyzing or crafting any prompt |
references/architecture-patterns.md | Agent, RAG, tool-calling, multi-agent design patterns | Crafting agent or system prompts |
references/context-management.md | Compaction, caching, context rot, ACE framework | Designing long-context or multi-turn systems |
references/hardening-checklist.md | Security and robustness checklist (29 items) | Hardening any prompt handling untrusted input |
references/evaluation-frameworks.md | Eval approaches, PromptOps lifecycle, tool guidance | Building evaluation frameworks |
references/output-formats.md | Templates for all skill outputs (scorecards, reports, diffs) | Formatting any skill output |
scripts/validate-references.py | Deterministic checks for reference index, provider metadata, dispatch/eval coverage | After editing references, dispatch, or evals |
Read reference files as indicated by the "Read When" column above. Do not rely on memory or prior knowledge of their contents. Reference files are the source of truth. If a reference file does not exist, proceed without it but note the gap.
references/hardening-checklist.md)audit) in Analyze is read-only — never modify the prompt being auditedreferences/model-playbooks.md, include its last-verified date, and flag stale guidance older than 90 daysscripts/validate-references.py after reference, dispatch, or eval changes| Rule pressure | Likely dodge | Required counter |
|---|---|---|
| Provider-doc freshness | "I remember the current model behavior." | Check model-playbooks.md and report the last-verified date. |
| Security hardening | "The prompt is internal only." | State the trust boundary; if any external content exists, run Mode E. |
| Eval follow-up | "The prompt looks fine." | Include at least golden, edge, adversarial, and regression eval categories. |
| Scope boundary | "Running it would prove the prompt works." | Refuse execution and provide a prompt/eval plan instead. |
| Dispatch ambiguity | "The intent is obvious enough." | Ask for mode selection when craft-vs-analyze or prompt-vs-implementation is mixed. |