From clab
Optimizes Anthropic API costs and latency using prompt caching for repeated prefixes and batch API for bulk async requests. Provides usage examples, pricing, and decision matrix for chats or evals.
npx claudepluginhub butanium/claude-lab --plugin clabThis skill uses the workspace's default tool permissions.
Caches the KV representations of prompt prefixes so repeated requests with shared context skip re-processing.
Optimizes Claude API performance with prompt caching, model selection, streaming, and latency techniques. For slow responses, token usage, or production time-to-first-token reduction.
Implements LLM prompt caching with Anthropic's native API, Redis-based response caching via hashing, and CAG patterns. Optimizes costs and latency for repeated prefixes or queries.
Implements in-memory and Redis caching for OpenRouter LLM API responses on deterministic requests to reduce costs and latency. Use for repeat queries or RAG systems.
Share bugs, ideas, or general feedback.
Caches the KV representations of prompt prefixes so repeated requests with shared context skip re-processing.
How it works:
cache_control breakpoint where the prefix is identicalcache_control at the boundary between stable context and variable content"ttl": "1h" for 1-hour duration (value must be a string: "5m" or "1h")Pricing (relative to base input cost):
Two modes:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "<large stable context>",
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "variable question"}],
)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"},
system="<large stable context>",
messages=[{"role": "user", "content": "variable question"}],
)
Verifying cache behavior — check response.usage:
cache_creation_input_tokens > 0 → cache miss, wrote to cachecache_read_input_tokens > 0 → cache hitProcesses requests asynchronously at 50% of standard pricing. Most batches complete in <1 hour, max 24 hours.
Limits: 100k requests or 256 MB per batch, whichever comes first.
When to use: any workload that doesn't need real-time responses — evals, bulk classification, data analysis, content generation.
With structured outputs (output_config.format.json_schema): all object types in the schema must have "additionalProperties": false. This is an API-level requirement (not batch-specific).
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request
batch = client.messages.batches.create(
requests=[
Request(
custom_id=f"request-{i}",
params=MessageCreateParamsNonStreaming(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": question}],
),
)
for i, question in enumerate(questions)
]
)
| Scenario | Real-time needed? | Shared prefix? | Strategy | Discount vs base |
|---|---|---|---|---|
| Chat / interactive | Yes | Yes (system prompt) | Prompt caching | ~90% on cached input |
| Chat / interactive | Yes | No | Standard API | None |
| Bulk eval / analysis | No | Yes | Batch + caching | 50% base + ~90% on cached input |
| Bulk eval / analysis | No | No | Batch only | 50% |
The discounts stack. Include identical cache_control blocks in every request within the batch.
Caveat: batch requests are processed concurrently and asynchronously, so cache hits are best-effort (typically 30-98% hit rate). To maximize hits:
{"type": "ephemeral", "ttl": "1h"}shared_system = [
{
"type": "text",
"text": "<large shared context>",
"cache_control": {"type": "ephemeral", "ttl": "1h"},
}
]
batch = client.messages.batches.create(
requests=[
Request(
custom_id=f"request-{i}",
params=MessageCreateParamsNonStreaming(
model="claude-sonnet-4-6",
max_tokens=1024,
system=shared_system,
messages=[{"role": "user", "content": q}],
),
)
for i, q in enumerate(questions)
]
)