From harness-claude
> Validate resilience by injecting controlled failures to verify that fallbacks, retries, and circuit breakers work under real conditions
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> Validate resilience by injecting controlled failures to verify that fallbacks, retries, and circuit breakers work under real conditions
Injects controlled faults like network latency, service failures, pod crashes, and resource exhaustion to validate resilience, graceful degradation, and automatic recovery.
Executes chaos engineering experiments injecting failures like network latency, service crashes, resource exhaustion to test resilience in distributed systems.
Guides chaos engineering workflows: define steady state, map failure domains, design experiments for network failures, timeouts, DNS issues using tc and toxiproxy. For resilience tests and game days.
Share bugs, ideas, or general feedback.
Validate resilience by injecting controlled failures to verify that fallbacks, retries, and circuit breakers work under real conditions
// chaos/fault-injector.ts
interface FaultConfig {
enabled: boolean;
latencyMs?: number; // Add artificial latency
errorRate?: number; // 0.0 to 1.0 probability of error
errorCode?: number; // HTTP status to return
timeoutRate?: number; // 0.0 to 1.0 probability of timeout
targetServices?: string[]; // Only affect specific services
}
export class FaultInjector {
private config: FaultConfig = { enabled: false };
configure(config: Partial<FaultConfig>) {
this.config = { ...this.config, ...config };
}
async maybeInjectFault(serviceName: string): Promise<void> {
if (!this.config.enabled) return;
if (this.config.targetServices && !this.config.targetServices.includes(serviceName)) return;
// Inject latency
if (this.config.latencyMs) {
await new Promise((r) => setTimeout(r, this.config.latencyMs));
}
// Inject timeout (never resolves until AbortController cancels)
if (this.config.timeoutRate && Math.random() < this.config.timeoutRate) {
await new Promise(() => {}); // Hang forever — caller's timeout should catch this
}
// Inject error
if (this.config.errorRate && Math.random() < this.config.errorRate) {
throw new ChaosError(`Injected fault for ${serviceName}`, this.config.errorCode ?? 500);
}
}
}
export class ChaosError extends Error {
constructor(
message: string,
public readonly statusCode: number
) {
super(message);
this.name = 'ChaosError';
}
}
// Integration with services
const faultInjector = new FaultInjector();
// Enable in test/staging via environment variable
if (process.env.CHAOS_ENABLED === 'true') {
faultInjector.configure({
enabled: true,
targetServices: ['payment-api'],
errorRate: 0.3, // 30% of payment API calls fail
latencyMs: 2000, // Add 2s latency to all calls
});
}
// Wrap service calls
export async function callPaymentAPI(orderId: string): Promise<PaymentResult> {
await faultInjector.maybeInjectFault('payment-api');
return fetch(`https://payment.example.com/charge/${orderId}`).then((r) => r.json());
}
// Chaos test scenario
describe('checkout resilience', () => {
it('completes checkout when payment service has 50% error rate', async () => {
faultInjector.configure({
enabled: true,
targetServices: ['payment-api'],
errorRate: 0.5,
});
// Circuit breaker + retry should handle transient failures
const result = await checkout(testOrder);
expect(result.status).toBe('completed');
faultInjector.configure({ enabled: false });
});
it('uses cached prices when pricing service is down', async () => {
faultInjector.configure({
enabled: true,
targetServices: ['pricing-api'],
errorRate: 1.0, // 100% failure
});
const result = await getProductPrice('sku-123');
expect(result.source).toBe('cache');
expect(result.price).toBeGreaterThan(0);
faultInjector.configure({ enabled: false });
});
});
Chaos engineering principles (Netflix):
Failure types to test:
Tools: toxiproxy (TCP proxy with configurable toxics), chaos-mesh (Kubernetes-native), litmus (Kubernetes chaos), gremlin (SaaS platform), pumba (Docker container chaos).
Production chaos safety:
https://principlesofchaos.org/