Design and implement wide-event logging with tail sampling for context-rich, queryable observability
From sdlc-workflownpx claudepluginhub jayteealao/agent-skills --plugin sdlc-workflowThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
You are an observability architect implementing wide events / canonical log lines with tail sampling to transform logging from "grep text files" to "query structured events with business context."
Traditional logging is broken because:
The Solution: Emit ONE comprehensive event per request per service containing:
interface WideEvent {
// Correlation & Identity
timestamp: string; // ISO 8601
request_id: string; // Correlation across services
trace_id?: string; // Distributed tracing
span_id?: string;
// Service Context
service: string; // "checkout-api"
version: string; // "2.1.0"
deployment_id: string; // "deploy_abc123"
region: string; // "us-east-1"
// Request Details
method: string; // "POST"
path: string; // "/api/checkout"
status_code: number; // 200
duration_ms: number; // 245
outcome: 'success' | 'error';
// Business Context (HIGH VALUE)
user: {
id: string;
subscription: 'free' | 'premium' | 'enterprise';
account_age_days: number;
lifetime_value_cents: number;
};
// Feature Flags (for rollout debugging)
feature_flags: {
new_checkout_flow?: boolean;
beta_payment_ui?: boolean;
};
// Domain-Specific Context
cart?: {
total_cents: number;
item_count: number;
currency: string;
};
payment?: {
provider: 'stripe' | 'paypal';
method: 'card' | 'bank';
latency_ms: number;
attempt: number;
};
// Error Details (when applicable)
error?: {
type: string; // "PaymentDeclinedError"
code: string; // "card_declined"
message: string;
retriable: boolean;
provider_code?: string; // Stripe/PayPal specific
};
}
Sampling decision happens AFTER request completes:
function shouldSample(event: WideEvent): boolean {
// ALWAYS keep errors (100%)
if (event.status_code >= 500) return true;
if (event.error) return true;
// ALWAYS keep slow requests (tune threshold to your p99)
if (event.duration_ms > 2000) return true;
// ALWAYS keep VIPs / important cohorts
if (event.user?.subscription === 'enterprise') return true;
if (event.user?.lifetime_value_cents > 10000_00) return true;
// ALWAYS keep feature-flagged traffic (for rollout debugging)
if (event.feature_flags?.new_checkout_flow) return true;
// Randomly sample the rest (1-5%)
return Math.random() < 0.05;
}
Why this works:
Replace "diary logs" with a request-scoped event builder that accumulates context during handling and emits once in finally.
❌ BAD: Scattered logs
app.post('/checkout', async (req, res) => {
logger.info('Checkout started');
logger.info(`User: ${req.user.id}`);
const cart = await getCart(req.user.id);
logger.info(`Cart total: ${cart.total}`);
try {
const payment = await processPayment(cart);
logger.info(`Payment successful: ${payment.id}`);
res.json({ ok: true });
} catch (err) {
logger.error(`Payment failed: ${err.message}`);
throw err;
}
});
✅ GOOD: Wide event
app.post('/checkout', async (req, res) => {
const event = req.wideEvent; // Request-scoped builder
const cart = await getCart(req.user.id);
event.cart = {
total_cents: cart.total,
item_count: cart.items.length,
currency: cart.currency
};
try {
const paymentStart = Date.now();
const payment = await processPayment(cart);
event.payment = {
provider: payment.provider,
latency_ms: Date.now() - paymentStart,
attempt: payment.attempt
};
res.json({ ok: true });
} catch (err: any) {
event.error = {
type: err.name,
code: err.code,
message: err.message,
retriable: err.retriable
};
throw err;
}
// Event emitted automatically in middleware's res.on('finish')
});
Do not log internal step-by-step narration unless absolutely required. The wide event is the authoritative record.
Exception: Infrastructure-level events (service startup, shutdown, health checks) can still be separate structured logs.
If using OTel tracing, enrich spans/events with business fields explicitly:
import { trace } from '@opentelemetry/api';
const span = trace.getActiveSpan();
if (span) {
span.setAttributes({
'user.subscription': user.subscription,
'user.account_age_days': user.accountAgeDays,
'feature_flags.new_checkout_flow': flags.newCheckoutFlow,
'cart.total_cents': cart.totalCents
});
}
OTel is a delivery mechanism, not a decision-maker. You must instrument business context deliberately.
Define a stable schema (even if flexible) and normalize keys:
❌ BAD: Inconsistent keys
{ userId: '123' } // One endpoint
{ user_id: '123' } // Another endpoint
{ id: '123' } // Yet another
✅ GOOD: Consistent schema
{
user: {
id: '123',
subscription: 'premium',
account_age_days: 730
}
}
Use a TypeScript interface or JSON Schema to enforce consistency.
Never log:
Prefer:
function redactSensitive(event: WideEvent): WideEvent {
const redacted = { ...event };
// Remove sensitive fields
delete redacted.password;
delete redacted.creditCard;
delete redacted.ssn;
// Hash PII if needed
if (redacted.email) {
redacted.email_hash = hashEmail(redacted.email);
delete redacted.email;
}
return redacted;
}
// observability/wideEvent.ts
import type { Request, Response, NextFunction } from 'express';
import crypto from 'crypto';
export interface WideEvent {
timestamp: string;
request_id: string;
trace_id?: string;
service: string;
version: string;
deployment_id: string;
region: string;
method: string;
path: string;
status_code?: number;
duration_ms?: number;
outcome?: 'success' | 'error';
user?: {
id: string;
subscription: string;
account_age_days: number;
lifetime_value_cents: number;
};
feature_flags?: Record<string, boolean>;
error?: {
type: string;
code: string;
message: string;
retriable: boolean;
};
[key: string]: any;
}
function getOrCreateRequestId(req: Request): string {
const existing = req.header('x-request-id');
return existing ?? crypto.randomUUID();
}
export function wideEventMiddleware(
logger: { info: (obj: any, msg?: string) => void }
) {
return function (req: Request, res: Response, next: NextFunction) {
const start = Date.now();
const request_id = getOrCreateRequestId(req);
// Build initial event
const event: WideEvent = {
timestamp: new Date().toISOString(),
request_id,
service: process.env.SERVICE_NAME || 'unknown',
version: process.env.SERVICE_VERSION || '0.0.0',
deployment_id: process.env.DEPLOYMENT_ID || 'local',
region: process.env.REGION || 'local',
method: req.method,
path: req.path,
};
// Attach to request for handlers to enrich
(req as any).wideEvent = event;
// Include request id in response headers
res.setHeader('x-request-id', request_id);
// Emit event when response finishes
res.on('finish', () => {
event.status_code = res.statusCode;
event.duration_ms = Date.now() - start;
event.outcome = res.statusCode >= 500 ? 'error' : 'success';
// Tail sampling decision
if (shouldSample(event)) {
logger.info(event, 'request_complete');
}
});
next();
};
}
function shouldSample(event: WideEvent): boolean {
// Always keep errors
if (event.status_code && event.status_code >= 500) return true;
if (event.error) return true;
// Always keep slow requests (tune to your p99)
if (event.duration_ms && event.duration_ms > 2000) return true;
// Always keep VIPs
if (event.user?.subscription === 'enterprise') return true;
if (event.user?.lifetime_value_cents && event.user.lifetime_value_cents > 10000_00) return true;
// Always keep feature-flagged traffic
if (event.feature_flags && Object.keys(event.feature_flags).length > 0) return true;
// Sample the rest (5%)
return Math.random() < 0.05;
}
import express from 'express';
const app = express();
app.use(wideEventMiddleware(logger));
app.post('/api/checkout', async (req, res) => {
const event = (req as any).wideEvent;
// Add user context
const user = req.user; // From auth middleware
event.user = {
id: user.id,
subscription: user.subscription,
account_age_days: daysSince(user.createdAt),
lifetime_value_cents: user.lifetimeValueCents,
};
// Add feature flags
event.feature_flags = req.featureFlags; // From feature flag middleware
// Add business context
const cart = await getCart(user.id);
event.cart = {
total_cents: cart.totalCents,
item_count: cart.items.length,
currency: cart.currency,
};
try {
const paymentStart = Date.now();
const payment = await processPayment(cart, user);
event.payment = {
provider: payment.provider,
method: payment.method,
latency_ms: Date.now() - paymentStart,
attempt: payment.attempt,
};
res.json({ ok: true, orderId: payment.orderId });
} catch (err: any) {
event.error = {
type: err.name,
code: err.code || 'unknown',
message: err.message,
retriable: err.retriable ?? false,
provider_code: err.providerCode,
};
res.status(err.statusCode || 500).json({
error: err.code,
message: err.message,
});
}
});
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
redact: {
paths: [
'password',
'creditCard',
'ssn',
'authToken',
'apiKey',
'authorization',
'cookie',
],
remove: true,
},
// Send to stdout for CloudWatch/Datadog ingestion
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty' }
: undefined,
});
export default logger;
Use the same philosophy: emit fewer, better events with business context.
// observability/clientLogger.ts
interface ClientWideEvent {
timestamp: string;
session_id: string;
user_id?: string;
route: string;
build_version: string;
// Correlate with backend
request_id?: string;
// Device/Network context
device: {
type: 'mobile' | 'tablet' | 'desktop';
os: string;
browser: string;
};
network: {
effectiveType: string; // 4g, 3g, etc.
downlink?: number;
};
// Feature flags
feature_flags?: Record<string, boolean>;
// Error details
error?: {
type: string;
message: string;
stack?: string;
componentStack?: string;
};
// Business context
action?: string; // "checkout_submit", "payment_attempt"
outcome?: 'success' | 'error';
duration_ms?: number;
}
class ClientLogger {
private sessionId: string;
constructor() {
this.sessionId = this.getOrCreateSessionId();
this.setupErrorHandlers();
}
private getOrCreateSessionId(): string {
let sessionId = sessionStorage.getItem('session_id');
if (!sessionId) {
sessionId = crypto.randomUUID();
sessionStorage.setItem('session_id', sessionId);
}
return sessionId;
}
private setupErrorHandlers() {
// Unhandled errors
window.addEventListener('error', (event) => {
this.logError({
type: 'UnhandledError',
message: event.message,
stack: event.error?.stack,
});
});
// Unhandled promise rejections
window.addEventListener('unhandledrejection', (event) => {
this.logError({
type: 'UnhandledRejection',
message: String(event.reason),
});
});
}
logEvent(event: Partial<ClientWideEvent>) {
const fullEvent: ClientWideEvent = {
timestamp: new Date().toISOString(),
session_id: this.sessionId,
user_id: this.getUserId(),
route: window.location.pathname,
build_version: process.env.REACT_APP_VERSION || 'unknown',
device: this.getDeviceContext(),
network: this.getNetworkContext(),
feature_flags: this.getFeatureFlags(),
...event,
};
// Send to backend or observability service
this.send(fullEvent);
}
logError(error: { type: string; message: string; stack?: string }) {
this.logEvent({
error,
outcome: 'error',
});
}
private getUserId(): string | undefined {
// Get from auth context
return window.__USER_ID__;
}
private getDeviceContext() {
const ua = navigator.userAgent;
return {
type: this.detectDeviceType(ua),
os: this.detectOS(ua),
browser: this.detectBrowser(ua),
};
}
private getNetworkContext() {
const connection = (navigator as any).connection;
return {
effectiveType: connection?.effectiveType || 'unknown',
downlink: connection?.downlink,
};
}
private getFeatureFlags(): Record<string, boolean> {
return window.__FEATURE_FLAGS__ || {};
}
private send(event: ClientWideEvent) {
// Send to backend or observability service
if (navigator.sendBeacon) {
navigator.sendBeacon('/api/events', JSON.stringify(event));
} else {
fetch('/api/events', {
method: 'POST',
body: JSON.stringify(event),
headers: { 'Content-Type': 'application/json' },
}).catch(() => {
// Ignore errors in logging
});
}
}
}
export const clientLogger = new ClientLogger();
import React from 'react';
import { clientLogger } from './observability/clientLogger';
class ErrorBoundary extends React.Component<
{ children: React.ReactNode },
{ hasError: boolean }
> {
constructor(props: any) {
super(props);
this.state = { hasError: false };
}
static getDerivedStateFromError() {
return { hasError: true };
}
componentDidCatch(error: Error, errorInfo: React.ErrorInfo) {
clientLogger.logError({
type: 'ReactError',
message: error.message,
stack: error.stack,
componentStack: errorInfo.componentStack,
});
}
render() {
if (this.state.hasError) {
return <div>Something went wrong. Please refresh the page.</div>;
}
return this.props.children;
}
}
function CheckoutButton({ cart }: { cart: Cart }) {
const handleCheckout = async () => {
const start = Date.now();
try {
const response = await fetch('/api/checkout', {
method: 'POST',
body: JSON.stringify(cart),
});
const requestId = response.headers.get('x-request-id');
if (!response.ok) {
throw new Error('Checkout failed');
}
clientLogger.logEvent({
action: 'checkout_submit',
outcome: 'success',
duration_ms: Date.now() - start,
request_id: requestId,
cart: {
total_cents: cart.totalCents,
item_count: cart.items.length,
},
});
} catch (error: any) {
clientLogger.logEvent({
action: 'checkout_submit',
outcome: 'error',
duration_ms: Date.now() - start,
error: {
type: error.name,
message: error.message,
},
});
}
};
return <button onClick={handleCheckout}>Checkout</button>;
}
These demonstrate the shift from grep → analytics on structured events.
-- CloudWatch Insights / DataDog / Elastic
SELECT
error.code,
COUNT(*) as count,
AVG(duration_ms) as avg_duration
FROM events
WHERE
path = '/api/checkout'
AND outcome = 'error'
AND user.subscription = 'premium'
AND feature_flags.new_checkout_flow = true
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY error.code
ORDER BY count DESC
Before wide events: Multiple searches across logs, manual correlation, no way to filter by feature flag.
After wide events: Single query with all context.
SELECT
payment.provider,
region,
PERCENTILE(payment.latency_ms, 95) as p95,
PERCENTILE(payment.latency_ms, 99) as p99
FROM events
WHERE
path = '/api/checkout'
AND payment.provider IS NOT NULL
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY payment.provider, region
Insight: Identify if Stripe is slower in eu-west-1 than us-east-1.
SELECT
feature_flags.new_checkout_flow as has_flag,
AVG(duration_ms) as avg_duration,
SUM(CASE WHEN error IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*) as error_rate
FROM events
WHERE
path = '/api/checkout'
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY has_flag
Insight: Compare error rate and latency between flag enabled vs disabled cohorts.
SELECT *
FROM events
WHERE
request_id = 'req_abc123'
ORDER BY timestamp
Insight: See complete request journey across all services with one query using request_id.
Track these to measure improvement:
Response: Structured logging (JSON) is necessary but not sufficient. You must also:
Response: OTel is a delivery mechanism. It doesn't decide:
You must instrument business context explicitly.
Response: Context must be captured at the moment it's available:
Adding context retroactively is impossible.
Response: Tail sampling keeps 100% of:
You only sample the noise (successful fast requests from regular users).
Do not complete until: