From latestaiagents
Measure and optimize the cost/quality curve — which model, prompt, and settings give the best quality per dollar. Covers Pareto analysis, break-even thresholds, and when to spend more vs less. Use this skill when optimizing LLM spend, picking a default model for a feature, or deciding whether a premium model is worth it. Activate when: cost vs quality, model selection, eval cost, Pareto frontier, cheaper model, premium model tradeoff.
npx claudepluginhub latestaiagents/agent-skills --plugin skills-authoringThis skill uses the workspace's default tool permissions.
**Quality without cost context is half a decision. You need the Pareto frontier — for each quality bar, what's the cheapest config that hits it?**
Compares AI and LLM models on benchmarks, capabilities, cost, latency, context window, and task-specific fit to help select optimal models for production use cases and budgets.
Compares Replicate AI models by cost, speed, quality, and capabilities. Fetches schemas via OpenAPI, pricing data, and runs test predictions for evaluation.
Optimizes Together AI costs for inference, fine-tuning, and deployment using model selection, batching, caching, and Python code examples with OpenAI-compatible API.
Share bugs, ideas, or general feedback.
Quality without cost context is half a decision. You need the Pareto frontier — for each quality bar, what's the cheapest config that hits it?
Plot each candidate config (model × prompt × settings) on quality (y-axis) vs cost per request (x-axis). The frontier is the set of configs where no other config is both cheaper AND better.
Any config NOT on the frontier is dominated — always strictly worse than another option. Drop it.
quality
↑
1 | *A (opus + thinking)
| *B (opus)
|*G *D (sonnet + few-shot)
|*F *C (sonnet)
0 |*E (haiku)
+---------------→ cost
Pareto: A, B, D, C, E. Dominated: F (worse than E at same cost), G (worse than D at same cost).
For each candidate, measure:
| Metric | Example |
|---|---|
| Input tokens / request | 2,500 |
| Output tokens / request | 400 |
| $ / request | $0.012 |
| Quality score | 0.87 |
| p95 latency | 1.8s |
const costPerRequest = (usage.input_tokens / 1e6) * inputRate +
(usage.output_tokens / 1e6) * outputRate +
(usage.cache_creation_input_tokens / 1e6) * cacheWriteRate +
(usage.cache_read_input_tokens / 1e6) * cacheReadRate;
Always include cache costs — they dominate on cached workloads.
For any feature, try at least:
One of these usually sits on the frontier for your workload. Don't assume — measure.
Before jumping to a bigger model, try prompt levers:
A better prompt on Haiku can beat a mediocre prompt on Sonnet — and cost 10× less.
When considering an upgrade, compute when it pays off:
Cost increase per request: Δcost = new - old
Quality increase: Δquality = new - old
Value per quality point: V (estimated from business metrics)
Worth it if: Δquality × V > Δcost
Example: If every 1% quality gain increases user retention revenue by $0.003/request, and upgrading Haiku→Sonnet costs +$0.002/request for +5% quality:
You don't have to pick one. Route by difficulty:
const difficulty = await classifyDifficulty(query);
const model = difficulty === "simple" ? "claude-haiku-4-5"
: difficulty === "medium" ? "claude-sonnet-4-6"
: "claude-opus-4-6";
Classification is a cheap Haiku call. Most queries are simple; you save money. Hard queries get the premium treatment.
Measure: does tiered routing actually improve your cost/quality position? Sometimes classification errors wipe out the gains.
Cost-quality isn't enough; latency matters too. Examples where it dominates:
Report 3-tuples: (quality, cost, p95 latency). The frontier in 3D is smaller; choose by which axis has a constraint.
If you can cache 90% of your input:
Decisions made without caching factored in are usually wrong. Re-measure with cache.
Don't eval each config on 10,000 items. Start small:
Saves 10-100× on eval cost.