From magic-powers
Use when reducing AI API costs — prompt caching, token reduction, batch processing, cost accounting for multi-step workflows, and building a cost optimization strategy for LLM-powered applications.
npx claudepluginhub kienbui1995/magic-powers --plugin magic-powersThis skill uses the workspace's default tool permissions.
LLM API costs grow faster than usage — every inefficiency compounds with scale. Optimization is not a one-time fix but a set of layered practices: know where money goes, reduce the most expensive drivers first, and track continuously so cost growth is caught before the invoice arrives.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
LLM API costs grow faster than usage — every inefficiency compounds with scale. Optimization is not a one-time fix but a set of layered practices: know where money goes, reduce the most expensive drivers first, and track continuously so cost growth is caught before the invoice arrives.
Understanding the cost structure is prerequisite to optimization:
Cost = (input_tokens × input_price) + (output_tokens × output_price)
Claude Sonnet 4.6: $3/1M input, $15/1M output → output is 5x more expensive
GPT-4o: $2.5/1M input, $10/1M output → output is 4x more expensive
Key insight: Generate LESS output, not shorter prompts.
Token cost drivers by category:
| Driver | Impact | Fix |
|---|---|---|
| Verbose system prompts | High input cost per request | Compress, cache prefix |
| Long output generation | Highest cost | Constrain with format/length instructions |
| Multi-step agent loops | Compound input+output × steps | Reduce unnecessary steps |
| RAG context | High input cost | Improve retrieval precision, trim irrelevant chunks |
| No caching on repeated prefixes | Redundant input tokens | Implement prompt caching |
Cache repeated prompt prefixes to pay only for the delta:
# Anthropic prompt caching — cache system prompt + static content
response = client.messages.create(
model="claude-sonnet-4-6",
system=[{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # 2000+ tokens
"cache_control": {"type": "ephemeral"} # cache for 5 minutes
}],
messages=[{"role": "user", "content": user_query}]
)
# First request: pay full price
# Subsequent requests: pay ~10% of system prompt cost (cached)
# OpenAI automatic caching (>1024 tokens prefix)
# Same request structure — caching applied automatically for repeated prefixes
Cache hit conditions:
Expected savings: 42% median reduction for workloads with stable system prompts.
Output tokens cost 3-8x more than input tokens — highest ROI optimization target:
# Verbose output — costs 3-5x more than needed
prompt = "Analyze this code and tell me everything you notice."
# Output: 2000 tokens of verbose analysis
# Constrained output — same information, 60-80% fewer tokens
prompt = """Analyze this code. Respond in JSON:
{
"issues": [{"severity": "high|medium|low", "description": "...", "line": N}],
"summary": "one sentence"
}
Maximum 5 issues. No explanation beyond the fields."""
# Output: 300 tokens of structured data
Output reduction techniques:
# On-demand (real-time): high cost, immediate response
response = client.messages.create(model="claude-sonnet-4-6", ...)
# Batch API (50% cheaper, ~24h turnaround — Anthropic and OpenAI both offer this)
batch = client.messages.batches.create(requests=[
{"custom_id": f"item_{i}", "params": {"model": "claude-sonnet-4-6", ...}}
for i in range(1000)
])
# ~50% cheaper, results available within 24 hours
When to batch:
When NOT to batch:
Single-call cost tracking misses compound costs in agent workflows:
class WorkflowCostTracker:
def __init__(self, budget_usd: float):
self.budget = budget_usd
self.steps = []
self.total_cost = 0.0
def record_step(self, step_name: str, usage: TokenUsage, model: str):
cost = calculate_cost(usage, model)
self.total_cost += cost
self.steps.append({
"step": step_name,
"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"cost_usd": cost,
"cumulative_cost": self.total_cost
})
if self.total_cost > self.budget * 0.8:
# Early warning at 80% of budget
log.warning(
f"Workflow at {self.total_cost/self.budget:.0%} of budget "
f"after {len(self.steps)} steps"
)
if self.total_cost > self.budget:
raise BudgetExceededError(f"Workflow exceeded ${self.budget} budget")
def report(self) -> CostReport:
top_steps = sorted(self.steps, key=lambda s: s["cost_usd"], reverse=True)[:3]
return CostReport(
total=self.total_cost,
step_count=len(self.steps),
most_expensive_steps=top_steps,
cost_per_step=self.total_cost / len(self.steps)
)
Beyond prompt caching — cache entire responses for identical inputs:
import hashlib
from cachetools import TTLCache
class CachedLLM:
def __init__(self, cache_ttl_seconds=3600):
self.cache = TTLCache(maxsize=1000, ttl=cache_ttl_seconds)
def generate(self, prompt: str, model: str, **kwargs) -> str:
# Only cache deterministic prompts (temperature=0)
if kwargs.get("temperature", 1.0) > 0:
return self._raw_generate(prompt, model, **kwargs)
cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
if cache_key in self.cache:
metrics.record("cache_hit", model=model)
return self.cache[cache_key]
result = self._raw_generate(prompt, model, **kwargs)
self.cache[cache_key] = result
metrics.record("cache_miss", model=model)
return result
Cache-able patterns:
Not cache-able: Personalized responses, time-sensitive queries, anything with temperature > 0.
model-routing for model-tier cost optimizationagentic-reliability — retries increase cost; budget tracking catches retry stormsllm-observability for production cost monitoring and trend alerts@ai-engineer uses this when designing AI system architecture