Agent reliability anti-patterns — retrying non-retryable errors, fixed sleep vs exponential backoff with jitter, single timeout for all call stack levels, aggressive circuit breaker thresholds, using Opus for every call regardless of complexity.
From clarcnpx claudepluginhub marvinrichter/clarc --plugin clarcThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
This skill extends agent-reliability with common mistakes and how to fix them. Load agent-reliability first.
Wrong:
async function callAgent(fn: () => Promise<string>): Promise<string> {
for (let i = 0; i < 3; i++) {
try { return await fn() }
catch { await sleep(1000) } // retries authentication errors — wastes 3 seconds
}
throw new Error('failed')
}
Correct:
function isRetryable(err: Error): boolean {
if (err.message.includes('rate_limit_error')) return true
if (err.message.includes('overloaded_error')) return true
// authentication_error, invalid_request_error — do NOT retry
return false
}
async function callAgent(fn: () => Promise<string>): Promise<string> {
return withRetry(fn, { retryableErrors: isRetryable })
}
Why: Retrying authentication or validation errors wastes time, inflates cost, and can trigger account lockouts — only transient infrastructure errors warrant retry.
Wrong:
for (let i = 0; i < 3; i++) {
try { return await agentCall() }
catch { await sleep(5000) } // fixed delay — thundering herd when many clients retry at once
}
Correct:
const exponential = initialDelayMs * Math.pow(backoffFactor, attempt - 1)
const capped = Math.min(exponential, maxDelayMs)
const delay = Math.random() * capped // full jitter — spreads retry load
await sleep(delay)
Why: Fixed retry intervals cause synchronized retry storms when many clients fail simultaneously; jitter spreads the load and reduces API overload cascades.
Wrong:
const TIMEOUT_MS = 30000
const result = await Promise.race([
runWorkflow(goal), // whole workflow — 30s is far too short
sleep(TIMEOUT_MS).then(() => { throw new Error('timeout') }),
])
Correct:
// Nested timeouts — each level has its own proportional budget
const toolResult = await callToolWithTimeout(tool, 15_000) // tool: 15s
const agentResult = await runAgentWithTimeout(agent, 60_000) // agent: 60s
const workflowResult = await runAgentWithTimeout( // workflow: 10min
() => runWorkflow(goal), 10 * 60 * 1000
)
Why: A single shared timeout either aborts long workflows prematurely or lets runaway tool calls consume the entire budget — layered timeouts bound each level independently.
Wrong:
const breaker = new CircuitBreaker(1, 60_000) // opens after a single failure
// One transient error now blocks all subsequent calls for 60 seconds
Correct:
const breaker = new CircuitBreaker(5, 60_000) // opens after 5 consecutive failures
// Transient errors are retried; the circuit opens only on sustained failure
Why: A threshold of 1 treats every transient error as a sustained outage, causing unnecessary downtime; calibrate the threshold to distinguish spikes from real failures.
Wrong:
async function classifyTaskComplexity(task: string): Promise<TaskComplexity> {
const response = await client.messages.create({
model: 'claude-opus-latest', // ~15x cost of Haiku for a three-word answer
system: 'Reply with "simple", "medium", or "complex".',
messages: [{ role: 'user', content: task }],
max_tokens: 10,
})
return response.content[0].text.trim() as TaskComplexity
}
Correct:
async function classifyTaskComplexity(task: string): Promise<TaskComplexity> {
const response = await client.messages.create({
model: 'claude-haiku-latest', // lightweight model for lightweight classification
system: 'Reply with exactly "simple", "medium", or "complex".',
messages: [{ role: 'user', content: task }],
max_tokens: 10,
})
return response.content[0].text.trim() as TaskComplexity
}
Why: Model selection should match task complexity — using Opus for trivial routing wastes budget that should be reserved for tasks requiring deep reasoning.
agent-reliability — retry with exponential backoff, timeout hierarchies, fallback chains, circuit breaker, cost control, observabilitymulti-agent-patterns — orchestration, routing, parallelization, handoffs