Plan instrumentation strategy before implementation, covering what to instrument, naming conventions, cardinality management, and instrumentation budget
Plans instrumentation strategy before implementation, covering what to instrument, naming conventions, cardinality management, and instrumentation budget. Use when planning new service instrumentation or reviewing telemetry strategy.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install systems-design@melodic-softwareThis skill is limited to using the following tools:
Strategic planning for application instrumentation before implementation.
Instrumentation Layers:
┌─────────────────────────────────────────────────────────────────┐
│ Layer 1: Automatic/Library Instrumentation │
│ - HTTP clients/servers (auto-captured) │
│ - Database clients (auto-captured) │
│ - Message queue clients (auto-captured) │
│ - Framework-provided metrics │
│ Effort: Low | Coverage: Broad | Customization: Limited │
├─────────────────────────────────────────────────────────────────┤
│ Layer 2: Business Transaction Instrumentation │
│ - Key user journeys │
│ - Business operations (checkout, signup, etc.) │
│ - Revenue-generating flows │
│ - SLA-bound operations │
│ Effort: Medium | Coverage: Targeted | Value: High │
├─────────────────────────────────────────────────────────────────┤
│ Layer 3: Debug/Diagnostic Instrumentation │
│ - Algorithmic hot paths │
│ - Cache behavior │
│ - Circuit breaker states │
│ - Retry/fallback paths │
│ Effort: Medium | Coverage: Deep | Use: Troubleshooting │
├─────────────────────────────────────────────────────────────────┤
│ Layer 4: Business Metrics │
│ - Domain-specific counters │
│ - Conversion rates │
│ - Feature usage │
│ - Customer behavior │
│ Effort: High | Coverage: Custom | Value: Business Insights │
└─────────────────────────────────────────────────────────────────┘
instrumentation_decisions:
always_instrument:
- "Inbound HTTP/gRPC requests"
- "Outbound HTTP/gRPC calls"
- "Database queries"
- "Message publish/consume"
- "Authentication/authorization"
- "External API calls"
- "Cache operations"
consider_instrumenting:
- "Complex business logic"
- "Feature flags evaluation"
- "Background jobs"
- "Scheduled tasks"
- "File I/O operations"
- "CPU-intensive operations"
avoid_instrumenting:
- "Every method call (too noisy)"
- "Tight loops (performance impact)"
- "Data transformation (low value)"
- "Validation helpers"
- "Utility functions"
decision_criteria:
business_value:
weight: 0.3
question: "Does this help understand business outcomes?"
debugging_value:
weight: 0.25
question: "Does this help diagnose production issues?"
slo_relevance:
weight: 0.25
question: "Does this contribute to SLI measurement?"
cost_impact:
weight: 0.2
question: "Is the cardinality/volume acceptable?"
metric_naming:
format: "[namespace]_[subsystem]_[name]_[unit]"
rules:
case: "snake_case"
unit_suffix: "Always include (_seconds, _bytes, _total)"
base_units: "Use base units (seconds not milliseconds)"
counter_suffix: "_total for counters"
examples:
good:
- "http_server_requests_total"
- "http_server_request_duration_seconds"
- "http_server_response_size_bytes"
- "db_connections_current"
- "order_processing_duration_seconds"
- "payment_transactions_total"
bad:
- "requests (no unit, no namespace)"
- "HttpRequestDuration (wrong case)"
- "order_latency_ms (use base units)"
- "totalOrders (camelCase, no unit)"
label_naming:
case: "snake_case"
avoid:
- "Embedded values in name (path=/users)"
- "High cardinality labels"
good_labels:
- "method, status_code, path"
- "service, version, environment"
bad_labels:
- "user_id (high cardinality)"
- "request_id (high cardinality)"
- "timestamp (not a dimension)"
span_naming:
format: "[operation] [resource]"
rules:
- "Use verb + noun pattern"
- "Keep names low cardinality"
- "Include operation type, not specific values"
- "Be consistent across services"
examples:
http:
pattern: "HTTP {METHOD} {route_template}"
good: "HTTP GET /users/{id}"
bad: "HTTP GET /users/12345"
database:
pattern: "{operation} {table}"
good: "SELECT orders"
bad: "SELECT * FROM orders WHERE id=123"
messaging:
pattern: "{operation} {queue/topic}"
good: "PUBLISH order-events"
bad: "publish message to order-events queue"
rpc:
pattern: "{service}/{method}"
good: "OrderService/CreateOrder"
bad: "grpc call to order service"
attributes:
required:
- "service.name"
- "service.version"
- "deployment.environment"
recommended:
http:
- "http.method"
- "http.route"
- "http.status_code"
- "http.target"
database:
- "db.system"
- "db.name"
- "db.operation"
- "db.statement (sanitized)"
messaging:
- "messaging.system"
- "messaging.destination"
- "messaging.operation"
log_naming:
format: "snake_case for all fields"
standard_fields:
timestamp: "ISO 8601 format"
level: "INFO, WARN, ERROR, etc."
message: "Human-readable description"
service: "Service name"
trace_id: "Correlation ID"
span_id: "Current span"
domain_fields:
pattern: "{domain}_{field}"
examples:
- "order_id"
- "customer_id"
- "payment_amount"
- "product_sku"
avoid:
- "Nested objects (flatten for indexing)"
- "Arrays of unknown length"
- "Large text blobs"
- "Sensitive data (PII, secrets)"
Cardinality = Number of unique time series
Example:
http_requests_total{method="GET", path="/api/users", status="200"}
Cardinality = methods × paths × statuses
= 5 × 100 × 10
= 5,000 time series
With user_id (1M users):
= 5 × 100 × 10 × 1,000,000
= 5,000,000,000 time series ← EXPLOSION!
cardinality_budget:
planning:
total_budget: 100000 # Target max time series per service
allocation:
automatic_instrumentation: 30% # 30,000
business_transactions: 40% # 40,000
custom_metrics: 20% # 20,000
buffer: 10% # 10,000
per_metric_limits:
low_cardinality:
max_series: 100
example: "status codes, methods"
medium_cardinality:
max_series: 1000
example: "endpoints, operations"
high_cardinality:
max_series: 10000
example: "aggregated by hour"
requires: "Justification and approval"
monitoring:
- "Alert on cardinality growth > 10% per day"
- "Weekly cardinality reviews"
- "Automatic label value limiting"
cardinality_reduction:
bucketing:
before: "path=/users/12345"
after: "path=/users/{id}"
technique: "Path template extraction"
sampling:
description: "Sample high-volume, low-value traces"
strategies:
head_sampling: "Decide at trace start"
tail_sampling: "Decide after seeing full trace"
adaptive: "Adjust rate based on volume"
aggregation:
description: "Pre-aggregate before export"
example: "Count by status, not per request"
value_limiting:
description: "Cap unique values per label"
example: "Max 100 unique paths, then 'other'"
dropping:
description: "Drop low-value dimensions"
candidates:
- "Instance ID (use service name)"
- "Request ID (not for metrics)"
- "Full URLs (use route templates)"
performance_budget:
cpu_overhead:
target: "< 1% CPU increase"
measurement: "Profile with/without instrumentation"
memory_overhead:
target: "< 50MB additional heap"
components:
- "Metric registries"
- "Span buffers"
- "Log buffers"
latency_overhead:
target: "< 1ms per request"
hot_paths: "< 100μs"
data_volume:
metrics:
target: "< 1GB/day per service"
calculation: "series × scrape_interval × 8 bytes"
traces:
target: "< 10GB/day per service (with sampling)"
sampling_rate: "1-10% for high-volume services"
logs:
target: "< 5GB/day per service"
strategies: "Sampling, level gating"
cost_planning:
estimation_formula:
metrics:
monthly_cost: "time_series × $0.003 (typical cloud pricing)"
example: "10,000 series × $0.003 = $30/month"
traces:
monthly_cost: "spans_per_month × $0.000005"
example: "100M spans × $0.000005 = $500/month"
logs:
monthly_cost: "GB_per_month × $0.50"
example: "500GB × $0.50 = $250/month"
optimization_strategies:
- "Increase scrape interval (15s → 60s)"
- "Reduce trace sampling rate"
- "Log level gating in production"
- "Shorter retention for debug data"
- "Downsample old metrics"
instrumentation_plan:
service: "{Service Name}"
version: "1.0"
date: "{Date}"
owner: "{Team}"
objectives:
- "Track SLIs for order processing"
- "Enable distributed tracing for debugging"
- "Monitor payment success rates"
automatic_instrumentation:
framework: "OpenTelemetry .NET"
enabled:
- "ASP.NET Core (HTTP server)"
- "HttpClient (HTTP client)"
- "Entity Framework Core (database)"
- "Azure.Messaging.ServiceBus"
configuration:
sampling_rate: 0.1 # 10% of traces
batch_export_interval: 5000 # ms
custom_spans:
- name: "ProcessOrder"
purpose: "Track order processing duration"
attributes:
- "order.id"
- "order.item_count"
- "order.total_amount"
events:
- "inventory.reserved"
- "payment.processed"
- name: "ValidatePayment"
purpose: "Track payment validation steps"
attributes:
- "payment.method"
- "payment.provider"
sensitive: false
custom_metrics:
counters:
- name: "orders_total"
labels: ["status", "payment_method"]
purpose: "Count orders by outcome"
cardinality_estimate: 20
- name: "payment_failures_total"
labels: ["reason", "provider"]
purpose: "Track payment failure reasons"
cardinality_estimate: 50
histograms:
- name: "order_processing_duration_seconds"
labels: ["order_type"]
purpose: "Track order processing latency"
buckets: [0.1, 0.5, 1, 2, 5, 10]
cardinality_estimate: 10
gauges:
- name: "pending_orders_current"
labels: []
purpose: "Current pending order count"
cardinality_estimate: 1
cardinality_summary:
estimated_total: 81
budget: 1000
status: "Within budget"
log_strategy:
production_level: "INFO"
structured_fields:
standard:
- "trace_id"
- "span_id"
- "service"
- "environment"
domain:
- "order_id"
- "customer_id (hashed)"
sampling:
debug_logs: "1% in production"
cost_estimate:
monthly:
metrics: "$30"
traces: "$200"
logs: "$150"
total: "$380"
review_schedule:
frequency: "Quarterly"
metrics_to_review:
- "Cardinality growth"
- "Data volume"
- "Cost vs budget"
observability-patterns - Three pillars overviewdistributed-tracing - Trace implementation detailsslo-sli-error-budget - What to measure for SLOsLast Updated: 2025-12-26
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.