From ai-engineer
Bootstrap the AI/ML documentation structure for a project. Creates docs/ai/, generates initial templates, and writes domain CLAUDE.md. Idempotent — merges missing sections into existing files without overwriting.
npx claudepluginhub hpsgd/turtlestack --plugin ai-engineerThis skill is limited to using the following tools:
Bootstrap the AI/ML documentation structure for **$ARGUMENTS**.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Compresses source documents into lossless, LLM-optimized distillates preserving all facts and relationships. Use for 'distill documents' or 'create distillate' requests.
Bootstrap the AI/ML documentation structure for $ARGUMENTS.
mkdir -p docs/ai
For each file below, apply the safe merge pattern:
<!-- Merged from ai-engineer bootstrap v0.1.0 -->docs/ai/CLAUDE.mdCreate with this content (~100 lines):
# AI Domain
This directory contains AI/ML documentation: prompt engineering conventions, model evaluation frameworks, RAG pipeline design, and AI safety guidelines.
## What This Domain Covers
- **Prompt engineering** — design patterns, versioning, and testing
- **Model evaluation** — benchmarks, metrics, and regression testing
- **RAG pipelines** — retrieval-augmented generation architecture
- **AI safety** — guardrails, content filtering, and risk mitigation
- **Cost management** — token budgets, model selection, and caching
- **Experiment tracking** — reproducible AI experiments
## Prompt Engineering Conventions
Follow the [Anthropic prompt engineering guide](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview) as the baseline.
### Prompt structure
1. **System prompt** — role, constraints, output format
2. **Context** — relevant documents, data, or prior conversation
3. **Task** — specific instruction with clear success criteria
4. **Examples** — few-shot examples for complex tasks
### Prompt versioning
- Store prompts in `prompts/` directory, not inline in code
- Version prompts with semantic versioning (same as code)
- Every prompt change requires an eval run before deployment
- Track prompt → model → output triples for reproducibility
### Prompt testing
- Unit test: does the prompt parse correctly and include required sections?
- Eval test: does the output meet quality thresholds on the eval suite?
- Regression test: does the new prompt maintain quality on previous test cases?
## Model Evaluation
### RAG evaluation (RAGAS)
Use [RAGAS](https://docs.ragas.io/) metrics for RAG pipeline quality:
| Metric | Measures | Target |
|--------|----------|--------|
| Faithfulness | Are answers grounded in retrieved context? | > 0.8 |
| Answer relevancy | Does the answer address the question? | > 0.8 |
| Context precision | Is retrieved context relevant? | > 0.7 |
| Context recall | Is all needed context retrieved? | > 0.7 |
### General evaluation (HELM)
Use [HELM](https://crfm.stanford.edu/helm/) framework principles for general model evaluation:
- Accuracy, calibration, robustness, fairness, efficiency
- Evaluate on domain-specific test sets, not just public benchmarks
### Eval suite conventions
- Store eval datasets in `evals/` directory
- Minimum 50 test cases per eval suite
- Run evals in CI on prompt or model changes
- Track eval scores over time for regression detection
## AI Safety (OWASP LLM Top 10)
Address the [OWASP Top 10 for LLM Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications/):
| Risk | Mitigation |
|------|------------|
| Prompt injection | Input validation, system prompt hardening |
| Insecure output handling | Output sanitisation, type checking |
| Training data poisoning | Data provenance tracking |
| Denial of service | Token limits, rate limiting, timeouts |
| Supply chain vulnerabilities | Model provenance, dependency audit |
| Sensitive information disclosure | PII filtering, output guardrails |
| Insecure plugin design | Least-privilege tool access |
| Excessive agency | Human-in-the-loop for high-risk actions |
| Overreliance | Confidence scoring, citation requirements |
| Model theft | Access controls, API key rotation |
## Cost Management
- Set per-request token budgets (input + output)
- Use cheaper models for simple tasks, capable models for complex tasks
- Cache identical or near-identical requests
- Monitor daily/weekly spend and alert on anomalies
- Log token usage per feature for cost attribution
## Tooling
| Tool | Purpose |
|------|---------|
| [OpenRouter](https://openrouter.ai) | Model access and routing |
| [GitHub Actions](https://docs.github.com/en/actions) | Eval suite in CI on prompt/model changes |
| [Perplexity](https://perplexity.ai) | AI research and literature review |
## Available Skills
| Skill | Purpose |
|-------|---------|
| `/ai-engineer:prompt-design` | Design and optimise prompts |
| `/ai-engineer:model-evaluation` | Create model evaluation suites |
| `/ai-engineer:rag-pipeline` | Design RAG pipeline architecture |
## Conventions
- Every prompt is version-controlled and eval-tested before deployment
- Model changes require eval regression testing
- AI safety review for any user-facing LLM feature
- Cost budgets are set per feature and monitored weekly
- RAG pipelines are evaluated with RAGAS metrics before release
- Experiment results are logged for reproducibility
docs/ai/prompt-library.mdCreate with this content:
# Prompt Library
> Central catalogue of production prompts. Each prompt is version-controlled and eval-tested.
## Prompt Registry
| Prompt ID | Version | Model | Purpose | Eval Score | Owner |
|-----------|---------|-------|---------|------------|-------|
| | | | | | |
## Prompt Template
### `{prompt_id}` — v{version}
**Model:** {model name and version}
**Temperature:** {0.0–1.0}
**Max tokens:** {limit}
#### System Prompt
[System prompt content here]
#### User Prompt Template
[User prompt with {variables} marked]
#### Variables
| Variable | Type | Description | Required |
|----------|------|-------------|----------|
| | | | |
#### Examples
**Input:**
[Example input]
**Expected output:**
[Example output]
#### Eval Results
| Eval Suite | Score | Date | Notes |
|------------|-------|------|-------|
| | | | |
> Add new prompts here before implementing in code. Every prompt must have at least one eval run.
docs/ai/eval-suite-template.mdCreate with this content:
# Eval Suite — [Feature/Prompt Name]
> Template for creating model evaluation suites. Copy for each new eval.
## Metadata
| Field | Value |
|-------|-------|
| Prompt ID | |
| Model | |
| Created | YYYY-MM-DD |
| Last run | YYYY-MM-DD |
| Owner | |
## Test Cases
| # | Input | Expected Output | Criteria | Pass/Fail |
|---|-------|----------------|----------|-----------|
| 1 | | | | |
| 2 | | | | |
## Metrics
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| Accuracy | >= % | % | |
| Faithfulness (RAGAS) | >= 0.8 | | |
| Relevancy (RAGAS) | >= 0.8 | | |
| Latency (p95) | <= ms | ms | |
| Cost per request | <= $ | $ | |
## Failure Analysis
| Test # | Failure Mode | Root Cause | Action |
|--------|-------------|------------|--------|
| | | | |
## History
| Date | Version | Score | Model | Notes |
|------|---------|-------|-------|-------|
| | | | | |
> Minimum 50 test cases per eval suite. Run evals in CI on every prompt change.
After creating/merging all files, output a summary:
## AI Engineer Bootstrap Complete
### Files created
- `docs/ai/CLAUDE.md` — domain conventions and skill reference
- `docs/ai/prompt-library.md` — prompt catalogue template
- `docs/ai/eval-suite-template.md` — model evaluation suite template
### Files merged
- (list any existing files where sections were appended)
### Next steps
- Register production prompts in `prompt-library.md`
- Create eval suites for each prompt using `/ai-engineer:model-evaluation`
- Design RAG pipelines using `/ai-engineer:rag-pipeline`
- Review AI safety checklist for user-facing features