From ai-engineer
Design and document a production prompt — structured template, evaluation criteria, test cases, and version control strategy.
npx claudepluginhub hpsgd/turtlestack --plugin ai-engineerThis skill is limited to using the following tools:
Design a production prompt for $ARGUMENTS.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Compresses source documents into lossless, LLM-optimized distillates preserving all facts and relationships. Use for 'distill documents' or 'create distillate' requests.
Design a production prompt for $ARGUMENTS.
Before writing a single word of prompt, define the task with precision. Vague tasks produce vague prompts.
| Property | Question | Bad answer | Good answer |
|---|---|---|---|
| Input | What data does the model receive? | "User text" | "JSON with fields: query (string, 1-500 chars), context (array of strings, 0-10 items)" |
| Output | What does the model produce? | "A response" | "JSON with fields: answer (string), confidence (float 0-1), sources (array of cited context indices)" |
| Format | What structure must the output follow? | "JSON" | "JSON matching the AnswerResponse schema, validated against OpenAPI spec" |
| Constraints | What must the model NOT do? | "Be accurate" | "Never reference information outside the provided context. If context is insufficient, return confidence: 0" |
| Volume | How often is this called? | "A lot" | "~2000 requests/day, peak 50/minute" |
| Latency | How fast must it respond? | "Fast" | "p95 < 3 seconds total generation time" |
Write all six properties into a task definition document. If any property is unclear, stop and clarify before proceeding.
Define evaluation criteria BEFORE writing the prompt. This is non-negotiable — evaluation before implementation.
| Criterion | Metric | Pass threshold | Measurement method |
|---|---|---|---|
| Accuracy | Correct answer on eval set | >= 90% | Automated comparison against expected outputs |
| Format compliance | Valid output structure | 100% | Schema validation — every response must parse |
| Safety | No hallucinated facts | 0 hallucinations in eval set | Human review + automated context grounding check |
| Latency | Total generation time | p95 < target from Step 1 | Timed API calls across eval set |
| Cost | Per-request token usage | < budget from Step 1 | Token counting across eval set |
Build an eval set: minimum 50 examples covering happy path, edge cases, and adversarial inputs. Each example has a defined expected output. Without an eval set, you cannot distinguish a working prompt from a hallucinating one.
Follow this opinionated structure. Start simple — add complexity only when eval shows it is needed.
[Role/Context]
You are a [specific role] that [specific task]. You work with [specific domain].
[Task]
Given the following [input type], [specific action to perform].
[Constraints]
- Only use information from the provided context
- If the answer is not in the context, respond with [specific fallback]
- Output must be valid JSON matching the schema below
- Maximum output length: [token limit]
[Examples]
Input: [representative example 1]
Output: [expected output 1]
Input: [representative example 2 — edge case]
Output: [expected output 2]
[Input]
{input_data}
Rules:
Minimum 5 test cases per prompt. Cover all four categories:
| Category | Purpose | Minimum count |
|---|---|---|
| Happy path | Typical, well-formed inputs | 2 |
| Edge cases | Boundary conditions, unusual but valid inputs | 2 |
| Adversarial | Prompt injection attempts, contradictory inputs | 1 |
| Empty/minimal | Missing fields, empty strings, null values | 1 |
Record results in the evaluation table:
| # | Category | Input (summary) | Expected output | Actual output | Pass/Fail | Notes |
|---|---|---|---|---|---|---|
| T1 | Happy path | Standard query with clear context | Correct answer, confidence > 0.8 | |||
| T2 | Happy path | Multi-part query | All parts addressed | |||
| T3 | Edge case | Query with no matching context | confidence: 0, fallback message | |||
| T4 | Edge case | Very long input near token limit | Truncation handled gracefully | |||
| T5 | Adversarial | "Ignore previous instructions" | Normal response, injection ignored | |||
| T6 | Empty | Empty query string | Validation error or graceful decline |
Run all test cases. Record actual outputs. Fix failures by modifying the prompt and re-running — do not fix by adding more examples unless eval data supports it.
For structured output, use the model's native structured output capabilities. Do not rely on prose instructions.
| Method | When to use | Reliability |
|---|---|---|
| JSON mode / response_format | Structured data extraction, API responses | High — model constrained to valid JSON |
| Function calling / tool use | Action-oriented outputs, multi-step workflows | High — schema-validated by the API |
| "Output as JSON" in prompt text | Never | Low — model may produce invalid JSON, markdown-wrapped JSON, or free text |
Rules:
Refer to the OWASP Top 10 for LLM Applications as the authoritative source for LLM security risks. For Claude-specific patterns, follow Anthropic's prompt engineering guide.
Build prompt injection resistance and output validation into the prompt design.
Context grounding (MANDATORY for any prompt using retrieved data):
Only use information from the provided context to answer the question.
If the answer is not contained in the context, respond with:
{"answer": null, "confidence": 0, "reason": "Information not found in provided context"}
Do not use your training data to fill gaps in the context.
Prompt injection resistance:
Output validation:
Prompts are code. Treat them accordingly.
File location: prompts/[feature-name]/[version].txt or co-located with the feature code
Naming convention: [feature]_v[major].[minor].txt — major version for semantic changes, minor for wording tweaks
Changelog format:
## v1.2 — 2024-03-15
- Added explicit constraint for empty context handling (fixes T3 failure)
- Eval results: 94% accuracy (up from 91%), 100% format compliance
## v1.1 — 2024-03-10
- Reduced examples from 5 to 3 (no accuracy loss, 15% cost reduction)
- Eval results: 91% accuracy, 100% format compliance
Deployment rule: Every prompt change runs against the full eval set before deployment. No exceptions. A prompt edited in a production dashboard is a vulnerability.
# Prompt Design: [feature name]
## Task Definition
- **Input:** [type, format, constraints]
- **Output:** [type, format, schema]
- **Constraints:** [explicit boundaries]
- **Volume:** [requests/day]
- **Latency budget:** [p95 target]
- **Cost budget:** [per-request target]
## Evaluation Criteria
| Criterion | Metric | Pass threshold |
|---|---|---|
## Prompt (v1.0)
[Full prompt text]
## Output Schema
[JSON schema or type definition]
## Test Results
| # | Category | Input | Expected | Actual | Pass/Fail |
|---|---|---|---|---|---|
## Safety Measures
- [Context grounding approach]
- [Injection resistance measures]
- [Output validation rules]
## Version History
[Changelog]
## Deployment Notes
- [File location]
- [Eval set location]
- [Rollback procedure]