From langchain-py-pack
Control LangChain 1.0 AI spend with accurate streaming token accounting, model tiering, provider-specific cache hit tuning, per-tenant budgets, and retry dedup. Use when AI spend grows faster than traffic, a cost regression lands, or you need per-tenant budget enforcement. Trigger with "langchain cost", "langchain token accounting", "langchain per-tenant budget", "langchain model tiering", "prompt cache savings".
npx claudepluginhub flight505/skill-forge --plugin langchain-py-packThis skill is limited to using the following tools:
An engineer shipped a new research agent Tuesday. By Friday the Anthropic
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
An engineer shipped a new research agent Tuesday. By Friday the Anthropic
bill had grown 6x while traffic grew 1.4x. The cost dashboard — wired to
on_llm_end — showed spend up maybe 2x. Reconciling against the provider
console on Monday surfaced two compounding bugs: (1) the agent's ChatOpenAI
fallback kept the default max_retries=6, so each logical call billed as up
to 7 requests (P30); (2) retry middleware was registered below token
accounting, so every retry fired on_llm_end twice — the aggregator summed
both emissions while LangSmith deduped them by generation ID, undercounting
the dashboard by ~50% against actual billed rate (P25).
The fix took an afternoon: cap retries at 2, tag retries with a stable
request_id, and migrate token accounting to AIMessage.usage_metadata read
from astream_events(version="v2"). Finding the bug took a week. This skill
is that week compressed into a runbook.
Cost tuning for a LangChain 1.0 production app has five levers, each with a sharp failure mode:
on_llm_end lags streams by 5-30s (P01); retries double-count (P25); Anthropic cache savings aggregate per-call, never per-session (P04).max_retries=6 default on ChatOpenAI (P30); Anthropic 50 RPM tier throttles cached and uncached calls against the same budget (P31).create_react_agent defaults to recursion_limit=25; vague prompts burn a session's budget before GraphRecursionError surfaces (P10).InMemoryCache ignores bound tools in the cache key and returns wrong answers (P61); RedisSemanticCache ships with a 0.95 threshold that hits <5% of the time (P62).claude-opus-4-5 on intent classification is 30-60x more expensive than claude-haiku-4-5 for a task the cheaper model solves at equal quality.Pin: langchain-core 1.0.x, langchain-anthropic 1.0.x, langchain-openai 1.0.x.
Pain-catalog anchors: P01, P04, P10, P23, P25, P30, P31, P61, P62.
langchain-core >= 1.0, < 2.0pip install langchain-anthropic langchain-openairedis-py >= 5.0 for budget middleware (optional; in-process dict works for dev)usage_metadata
against billed spend — you will need this to verify any instrumentation fixusage_metadata, never response_metadata["token_usage"]LangChain 1.0 standardizes all provider usage into AIMessage.usage_metadata.
response_metadata["token_usage"] still exists as a compatibility shim but its
shape is provider-specific (Anthropic nests under usage, OpenAI flat, Gemini
uses different keys). Code that reads it directly will break when you switch
providers or when a provider SDK upgrades.
from langchain_core.messages import AIMessage
def read_usage(msg: AIMessage) -> dict:
"""Canonical shape: input_tokens, output_tokens, input_token_details,
output_token_details. Safe across Anthropic, OpenAI, Gemini."""
meta = msg.usage_metadata or {}
details_in = meta.get("input_token_details", {}) or {}
details_out = meta.get("output_token_details", {}) or {}
return {
"input": meta.get("input_tokens", 0),
"output": meta.get("output_tokens", 0),
"cache_read": details_in.get("cache_read", 0), # Anthropic
"cache_creation": details_in.get("cache_creation", 0),
"reasoning": details_out.get("reasoning", 0), # OpenAI o1/o3
}
Include reasoning in your output-billable total for o1/o3. A call with
output_tokens=500 and reasoning=2000 actually bills 2500 output tokens.
astream_events(version="v2")on_llm_end fires once after the stream closes, so dashboards lag by stream
duration (P01). Anthropic populates usage_metadata on the message_start and
message_delta events; OpenAI populates only the final chunk. Both show up as
on_chat_model_stream events in astream_events.
async def metered_invoke(chain, inputs, meter):
async for event in chain.astream_events(inputs, version="v2"):
if event["event"] == "on_chat_model_stream":
chunk = event["data"]["chunk"]
if getattr(chunk, "usage_metadata", None):
meter.record(
run_id=event["run_id"],
usage=chunk.usage_metadata,
)
See Token Accounting Pitfalls for the full streaming-delta behavior across providers and reconciliation against provider dashboards.
run_id, not prompt hashRetry middleware runs the model twice on transient errors. Both emit usage
events. If the aggregator keys on prompt hash, it looks like one call cost
twice as much. If it keys on run_id (LangChain assigns one per generation
attempt), you can attach a stable request_id at the chain level and dedupe
on that (P25).
from uuid import uuid4
class RetryAwareMeter:
def __init__(self):
self._seen: set[str] = set()
self.totals = {"input": 0, "output": 0, "cache_read": 0}
def record(self, run_id: str, usage: dict, request_id: str | None = None):
# Keep only the last emission per logical request.
# On retry: same request_id, different run_id -> overwrite.
key = request_id or run_id
if key in self._seen:
# Retry emission — subtract prior, add new (last wins).
prior = self._prior_by_key.get(key, {})
for k in self.totals:
self.totals[k] -= prior.get(k, 0)
self._seen.add(key)
self._prior_by_key[key] = usage
self.totals["input"] += usage.get("input_tokens", 0)
self.totals["output"] += usage.get("output_tokens", 0)
details = usage.get("input_token_details", {}) or {}
self.totals["cache_read"] += details.get("cache_read", 0)
Inject request_id via config={"metadata": {"request_id": str(uuid4())}} on
each invoke. The meter reads event["metadata"]["request_id"] alongside
run_id.
Alternative: place token accounting above retry middleware in the chain — retries happen inside, so only the successful attempt emits. This is simpler but makes retries invisible to observability, which you usually want to see.
Most chains have a structural split: a cheap "understand the request" call and an expensive "produce the final artifact" call. Running the expensive model on both roughly triples cost for no quality gain.
Per-1M pricing snapshot, 2026-04 (verify current prices before shipping at https://www.anthropic.com/pricing and https://openai.com/api/pricing/):
| Model | Input $/1M | Output $/1M | Cache read $/1M | Role |
|---|---|---|---|---|
claude-haiku-4-5 | $1.00 | $5.00 | $0.10 | Draft, classify, route |
claude-sonnet-4-6 | $3.00 | $15.00 | $0.30 | Finalize, reason, extract |
claude-opus-4-5 | $15.00 | $75.00 | $1.50 | High-stakes, long-horizon |
gpt-4o-mini | $0.15 | $0.60 | n/a (prefix cache only) | Draft, classify |
gpt-4o | $2.50 | $10.00 | n/a | Finalize |
gpt-o3-mini | $1.10 | $4.40 | n/a | Reasoning, planning |
Anthropic cache reads cost 10% of input. Cache creation costs 125% of input. Break-even is ~4 uses of a cached prefix. See Cache Economics.
Decision tree:
input
└── intent classification / routing
└── gpt-4o-mini OR claude-haiku-4-5 (~$0.15-$1 per 1M in)
└── generation / reasoning
├── single-pass, low-stakes
│ └── gpt-4o-mini (draft)
├── single-pass, high-stakes (extraction, contracts)
│ └── claude-sonnet-4-6 (finalize)
├── multi-step reasoning
│ └── gpt-o3-mini OR claude-sonnet-4-6 (plan)
└── mission-critical long-horizon
└── claude-opus-4-5 (expensive, used sparingly)
Tiering is wrong when quality degrades silently — high-stakes extraction on Haiku misses entities the Sonnet would catch. Always evaluate both tiers on a gold set before committing. See Model Tiering for the evaluation harness and a worked draft-then-finalize chain.
usage_metadata["input_token_details"]["cache_read"] reports per-call. To see
whether caching is paying for itself you need to aggregate per-session or
per-tenant and compare against cache-creation cost.
class CacheLedger:
def __init__(self, tenant_id: str):
self.tenant_id = tenant_id
self.read = 0 # billed at 0.1x input rate
self.creation = 0 # billed at 1.25x input rate
self.uncached_input = 0 # billed at 1.0x input rate
def ingest(self, usage: dict):
details = usage.get("input_token_details", {}) or {}
self.read += details.get("cache_read", 0)
self.creation += details.get("cache_creation", 0)
total_input = usage.get("input_tokens", 0)
self.uncached_input += total_input - self.read - self.creation
def savings_vs_no_cache(self, price_per_1m_input: float) -> float:
# What we paid with cache vs. paying full price on all input.
actual = (self.uncached_input * 1.00
+ self.creation * 1.25
+ self.read * 0.10) * price_per_1m_input / 1_000_000
naive = (self.uncached_input + self.creation + self.read) * price_per_1m_input / 1_000_000
return naive - actual
Persist CacheLedger to Redis or Postgres keyed by (tenant_id, day). If
savings is negative over a 24h window, caching is costing more than it saves —
your cached prefix is either too short or hit too rarely. See
Cache Economics.
set_llm_cache(InMemoryCache()) hashes the prompt string only. A chain that
binds different tool sets will return wrong answers from the cache. This is
the most dangerous cache failure mode — it silently returns semantically
incorrect responses rather than missing.
Do not use InMemoryCache on any chain that calls bind_tools(). Use
SQLiteCache or RedisSemanticCache with a composite key:
import hashlib, json
def tool_aware_key(prompt: str, tools: list) -> str:
tools_fingerprint = hashlib.sha256(
json.dumps([t.args_schema.model_json_schema() for t in tools],
sort_keys=True).encode()
).hexdigest()[:16]
return f"{tools_fingerprint}:{hashlib.sha256(prompt.encode()).hexdigest()}"
Cross-reference: langchain-middleware-patterns covers cache-key layering in
middleware order (redact → cache → model) to avoid cross-tenant PII leaks.
RedisSemanticCache defaults to score_threshold=0.95. On real workloads this
hits under 5% of the time. Production-tuned values land between 0.85 and 0.90.
Ship with a calibrated threshold, not the default:
[0.80, 0.82, 0.84, …, 0.95], compute:
If the curve flattens above 0.92 on positives, your embeddings are too weak
for semantic caching — consider exact-match SQLiteCache instead. See
Cache Economics for the calibration worksheet.
A single runaway tenant will consume the pack if left uncapped. LangChain 1.0 middleware slots cleanly in front of the model call; back it with a Redis counter keyed per tenant per day.
# Sketch — see references/per-tenant-budgets.md for the full middleware class.
async def budget_check(tenant_id: str, estimated_tokens: int) -> str:
day_key = f"budget:{tenant_id}:{date.today().isoformat()}"
used = int(await redis.get(day_key) or 0)
soft = TENANT_SOFT_CAPS[tenant_id] # alert only
hard = TENANT_HARD_CAPS[tenant_id] # refuse
projected = used + estimated_tokens
if projected > hard:
raise BudgetExceeded(tenant_id, used, hard)
if projected > soft:
emit_alert(tenant_id, used, soft)
return day_key # caller increments on completion
Grace period: on hard-cap hit, allow in-flight calls (they already billed) but reject new ones. Reset counter on UTC day boundary. See Per-Tenant Budgets for full middleware class, Redis schema, alert wiring, and grace-period semantics.
create_react_agent defaults to recursion_limit=25. Agents on vague prompts
loop until limit, then raise GraphRecursionError — but every loop billed.
Cap at 5-10 for interactive, 10-15 for batch. If you use trim_messages on the
loop to control context, pass include_system=True and start_on="human" so
the trimmer does not drop the system prompt under pressure (P23).
Pair this with Step 8's budget middleware — the budget is the hard stop even
if the recursion cap is generous. Cross-reference: langchain-langgraph-agents
for routing patterns that terminate early on repeated tool calls.
usage_metadata, including reasoning and cache fieldsastream_events(version="v2") on on_chat_model_streamrequest_idCacheLedger that reports Anthropic cache savings per session/tenantrecursion_limit set to match the workload, not the default 25| Error / symptom | Cause | Fix |
|---|---|---|
| Dashboard lags provider console by minutes on streaming calls | on_llm_end fires at stream close (P01) | Migrate to astream_events(version="v2") + on_chat_model_stream (Step 2) |
| Dashboard shows ~50% of billed spend | Retry middleware double-emits on_llm_end, aggregator sums both (P25) | Dedupe on request_id (Step 3), or move accounting above retry middleware |
| Cache hits return wrong answer for tool-binding chain | InMemoryCache hashes prompt only, ignores tools (P61) | Switch to SQLiteCache / RedisSemanticCache with tool-aware key (Step 6) |
RedisSemanticCache hit rate < 5% on similar queries | Default score_threshold=0.95 too strict (P62) | Calibrate to 0.85-0.90 against a gold pair set (Step 7) |
Cost spike then GraphRecursionError: Recursion limit of 25 reached | Agent loops on vague prompt (P10) | Set recursion_limit=5-10; add budget middleware (Step 8) |
| Agent loses persona mid-conversation after many turns | trim_messages dropped system prompt (P23) | Pass include_system=True, start_on="human" |
max_retries=6 billing 7 requests per logical call | ChatOpenAI default (P30) | Set max_retries=2; log every retry via callback to verify |
| 429 on cache reads while input-token budget shows headroom | Anthropic 50 RPM throttles cached and uncached together (P31) | Budget RPM at client level (semaphore); separate monitors for read vs uncached |
| "Cache savings" metric always zero | input_token_details.cache_read reset per call (P04) | Aggregate via CacheLedger keyed per session/tenant (Step 5) |
A team saw Anthropic spend 6x over a week with traffic up 1.4x. The dashboard
showed only 2x. Root cause: max_retries=6 on a flaky downstream API (P30)
plus retry middleware double-emit (P25) undercounting on the dashboard.
Fix sequence: (1) set max_retries=2, (2) attach request_id metadata at
chain entry, (3) migrate meter to astream_events with run_id dedup. After
the fix, dashboard matched billed spend within 1%.
See Token Accounting Pitfalls for the full reconciliation procedure against provider console CSVs.
A document-extraction chain runs claude-haiku-4-5 to extract a rough outline,
then claude-sonnet-4-6 to validate and fill missing fields. On a 10K-doc
batch:
The draft step burned 80% of the input tokens on the cheaper model.
See Model Tiering for the full chain, the gold set, and the evaluation harness.
One tenant's prompt template had an accidental double-interpolation of the message history that grew context unboundedly each turn. Spend 400x'd overnight. The per-tenant budget middleware (Step 8) hit hard cap at 10x normal, alerted on soft cap at 5x, refused new requests, allowed in-flight calls to complete.
See Per-Tenant Budgets for the full middleware, alert wiring, and grace-period semantics.
langchain-performance-tuning (latency, throughput, cache hit
rate) — this skill focuses on spend, not speed; reference each other on cache
tuninglangchain-middleware-patterns (cache-key order, retry telemetry),
langchain-langgraph-agents (recursion caps, early termination),
langchain-rate-limits (RPM/ITPM budgeting, companion to Step 4)usage_metadataastream_events v2docs/pain-catalog.md (P01, P04, P10, P23, P25, P30, P31, P61, P62)