Help us improve
Share bugs, ideas, or general feedback.
From builder-ai
Use when prompt cost is too high, latency is above threshold, or context window limits are being approached. Requires measurement before and after each reduction. Blocks "I shortened the prompt so it should be cheaper" completions.
npx claudepluginhub rbraga01/a-team --plugin builder-aiHow this skill is triggered — by the user, by Claude, or both
Slash command
/builder-ai:context-optimizationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
Share bugs, ideas, or general feedback.
PROMPT COST IS NOT OPTIMISED BY GUESSING.
"I already have a short prompt" is a guess about token count.
"Reducing context will hurt quality" is a guess about the quality/cost curve.
Measure first. Apply the hierarchy. Measure again. THEN claim improvement.
Trigger when:
Apply in order. Stop when the target is met. Do not apply all steps preemptively.
Audit every sentence:
Target: system prompt under 500 tokens for most tasks. Measure token count precisely:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
print(len(enc.encode(system_prompt)))
If using RAG, before any other change:
top_k by 1 and run eval-before-ship — recall often holds at lower top_kMAX_CONTEXT_TOKENS = context_window × 0.4Before injecting:
Move expensive operations to a cheaper model:
Before: Frontier model does everything
After: Fast/cheap model → classify, extract, summarise
Frontier model → final answer generation only
This typically cuts costs 5–10× with minimal quality loss. The frontier model only sees pre-processed, high-signal input.
If system prompt or retrieval context is identical across many calls:
Prompt caching saves 50–90% on the cached portion. It is the highest-leverage optimization when the system prompt is long and stable.
Run this before applying any reduction, and after:
Token count: <input tokens>, <output tokens>, <total>
Latency: p50 = Xs, p95 = Ys
Cost/1k: $Z
Quality: <eval suite pass rate: A%>
A reduction that saves 30% cost but drops quality 10% past threshold is not an optimization — it is a regression.
These thoughts mean the prompt has not been measured — stop:
When context-optimization is satisfied, state it like this:
Context optimized.
Reductions applied: <list of levels: e.g., L1 (trim), L2 (top_k 10→5), L5 (caching)>
Before:
Input tokens: N (system: A, context: B, user: C)
Latency p95: Xs
Cost/1k: $Y
Quality: Z% — evals/<feature>/results-<date-before>.md
After:
Input tokens: N' (system: A', context: B', user: C')
Latency p95: X's
Cost/1k: $Y'
Quality: Z'% — evals/<feature>/results-<date-after>.md ✓ (above threshold)
Delta: -X% cost, -Ys latency, quality delta: Z'% - Z% = ±Xpp
Quality must be verified with eval-before-ship — not eyeballed.
LLM API costs compound with volume. A pipeline that costs $0.05/call at 1k calls/day costs $1,800/month. The same pipeline at 10k calls/day costs $18,000/month. Optimization at 1k is a 2-hour project. Optimization at 10k is a crisis sprint.