Cost Anomaly Detection
Purpose
This skill proactively identifies unusual cost patterns, unexpected spikes, irregular spending behaviors, and anomalies that may indicate problems, inefficiencies, or opportunities for optimization.
When to Use
- "Are there any cost anomalies?"
- "Check for unusual spending"
- "Find cost issues"
- "What looks wrong with my costs?"
- "Detect abnormal costs"
- Proactive cost monitoring
- Weekly/monthly cost reviews
- Security incident detection
- Waste identification
- Before presenting cost reports
- Keywords: anomaly, unusual, abnormal, irregular, unexpected, odd, suspicious, detect issues
Prerequisites
This skill builds on the understand-cloudzero-organization skill.
Before applying this procedure:
- If you haven't already in this session, load the understand-cloudzero-organization skill and follow its instructions
- Reference the cached organization context (don't reload unnecessarily)
- Organization context is critical for distinguishing legitimate changes from true anomalies
How This Skill Works
Step 1: Establish Baseline
Query historical data to establish normal patterns:
# Recent period
get_cost_data(
granularity="daily",
date_range="last 30 days",
cost_type="real_cost"
)
# Compare to baseline period
get_cost_data(
granularity="daily",
date_range="30 to 60 days ago",
cost_type="real_cost"
)
Calculate baseline statistics:
- Mean daily cost
- Standard deviation
- Normal range (e.g., mean ± 2 standard deviations)
- Typical day-of-week patterns
- Expected growth rate
Step 2: Total Cost Anomaly Detection
Identify days with unusual total spending:
Detect Outliers:
For each day in recent period:
If cost > (baseline_mean + 2 × baseline_stddev):
Flag as high anomaly
If cost < (baseline_mean - 2 × baseline_stddev):
Flag as low anomaly (potential data issue or optimization)
Look for:
- Single-day spikes (unusual one-time events)
- Sustained increases (new baseline)
- Gradual drift away from normal
- Weekend vs. weekday anomalies
- Unexpected patterns
Step 3: Service-Level Anomaly Detection
Check each service for unusual behavior:
# Get services with daily breakdown
get_cost_data(
group_by=["CZ:Service"],
granularity="daily",
limit=20
)
# Compare recent pattern to baseline for each service
For each major service:
- Calculate its typical daily cost
- Identify days with unusual spending
- Detect new services that appeared
- Detect services that disappeared
- Calculate variance from expected
Anomaly Types:
- Spike: Sudden increase then return to normal
- Step Change: Sudden increase that persists
- Gradual Drift: Slow increase over time
- Drop: Unexpected decrease
- New Appearance: Service that didn't exist before
- Disappearance: Service that stopped
Step 4: Account-Level Anomaly Detection
Identify accounts with unusual spending:
get_cost_data(
group_by=["CZ:Account"],
granularity="daily",
limit=20
)
For each account:
- Compare to its historical pattern
- Flag accounts with >50% increase from baseline
- Identify new accounts with unexpected high costs
- Detect accounts with no activity (potential issue)
Step 5: Resource-Level Anomaly Detection
Identify specific resources with unusual costs:
# Get top resources
get_cost_data(
group_by=["CZ:Resource"],
limit=50
)
# Compare to previous period
get_cost_data(
group_by=["CZ:Resource"],
date_range="previous period",
limit=50
)
Look for:
- New high-cost resources
- Resources with sudden cost increases
- Resources that appeared recently
- Expensive resources without proper tags
Step 6: Regional Anomaly Detection
Check for unusual regional spending patterns:
get_cost_data(
group_by=["CZ:Region"],
granularity="daily",
limit=20
)
Anomalies might indicate:
- Unauthorized resource creation in unexpected regions
- Data transfer anomalies
- Failover events
- Misconfigured deployments
Step 7: Usage Pattern Anomalies
Detect unusual usage patterns:
Hourly Pattern Analysis (if examining recent days):
get_cost_data(
granularity="hourly",
date_range="last 7 days"
)
Look for:
- 24/7 costs when should be business hours only
- Weekend activity when shouldn't exist
- Off-hours spikes (potential security issue)
- Missing expected peaks (potential outage)
Day-of-Week Patterns:
- Calculate average cost per day of week
- Compare recent weeks to baseline weeks
- Flag unusual weekday/weekend ratios
Step 8: Multi-Dimensional Anomaly Detection
Cross-reference anomalies across dimensions:
get_cost_data(
group_by=["CZ:Account", "CZ:Service", "CZ:Region"],
limit=100
)
Find:
- Specific service in specific account with anomaly
- Regional anomalies for specific services
- Account+Service combinations that are unusual
Step 9: Rate-of-Change Anomalies
Detect unusual growth rates:
Calculate for each dimension value:
recent_rate = (cost_this_week - cost_last_week) / cost_last_week
typical_rate = historical average growth rate
If recent_rate > (typical_rate + threshold):
Flag as accelerating growth anomaly
Step 10: Security and Waste Indicators
Look for specific patterns indicating issues:
Potential Security Issues:
- New EC2 instances in unusual regions
- Sudden spike in compute or network costs
- Resources created in accounts with no recent activity
- Large data transfer spikes
- Cryptocurrency mining patterns (sustained high compute)
Potential Waste:
- EBS volumes without attached instances
- Old snapshots accumulating
- Unused Reserved Instances
- Idle RDS databases (consistent low cost)
- Over-provisioned resources
Potential Misconfigurations:
- Public S3 buckets with high request costs
- NAT Gateway traffic spikes
- Logging to expensive destinations
- Unoptimized data transfer routes
Step 11: Tag-Based Anomaly Detection
Check for anomalies in tagged resources:
get_cost_data(
group_by=["CZ:Tag:Environment", "CZ:Service"],
granularity="daily",
limit=50
)
Anomalies might be:
- Non-prod environments at prod scale
- Test environments with sustained high costs
- Development resources left running 24/7
Output Format
Provide comprehensive anomaly report:
1. Executive Summary
- Anomaly Count: X anomalies detected
- Severity: [High: X, Medium: Y, Low: Z]
- Potential Cost Impact: $X,XXX/month if unaddressed
- Most Critical: [Brief description of #1 issue]
- Action Required: [Yes/No and urgency]
2. Anomaly Severity Classification
HIGH SEVERITY (Immediate Action Required):
- [Anomaly description]
- Detected: [Date/time]
- Impact: $X,XXX
- Potential cause: [Analysis]
- Recommended action: [Specific steps]
MEDIUM SEVERITY (Review Within 24-48 Hours):
- [Anomaly description]
LOW SEVERITY (Monitor or Investigate When Convenient):
- [Anomaly description]
3. Detailed Anomaly Analysis
For each significant anomaly:
Anomaly #1: [Descriptive Title]
Type: [Spike / Step Change / Drift / New Resource / etc.]
Severity: [High / Medium / Low]
Detected: [Date/time first observed]
Impact: $X,XXX (XX% above normal)
Details:
- What: [Specific description of the anomaly]
- Where: [Account / Service / Region / Resource]
- When: [Time period]
- Baseline: Normal cost is $X, observed cost is $Y
- Deviation: XX% above/below normal (Z standard deviations)
Pattern Analysis:
- First observed: [Date]
- Duration: [Ongoing / X days]
- Trend: [Growing / Stable / Declining]
- Time pattern: [Constant / Hourly / Daily pattern]
Potential Causes:
- [Most likely cause with reasoning]
- [Alternative explanation]
- [Other possibilities]
Related Anomalies:
- [Other anomalies that might be connected]
Recommendations:
- Immediate: [Action to take now]
- Investigation: [What to check]
- Remediation: [How to fix]
- Prevention: [How to avoid future occurrences]
Estimated Impact If Not Addressed:
- Daily: $XXX
- Monthly: $X,XXX
- Annual: $XX,XXX
4. Anomaly Dashboard
Cost Anomalies by Category:
| Category | Count | Total Impact | Avg Impact |
|---|
| Compute Spikes | X | $X,XXX | $XXX |
| Storage Growth | X | $X,XXX | $XXX |
| Data Transfer | X | $X,XXX | $XXX |
| New Resources | X | $X,XXX | $XXX |
| Security Concerns | X | $X,XXX | $XXX |
| Waste/Idle | X | $X,XXX | $XXX |
Anomalies by Dimension:
| Dimension | Anomaly Count | Most Affected Value | Impact |
|---|
| Service | X | [Service name] | $X,XXX |
| Account | X | [Account ID] | $X,XXX |
| Region | X | [Region] | $X,XXX |
5. Time-Series Anomaly Visualization
Cost Over Time with Anomalies Highlighted:
[Describe the pattern, indicating where anomalies occurred]
Days with anomalies:
- [Date]: $X,XXX (XX% above baseline) - [Service/Account]
- [Date]: $X,XXX (XX% above baseline) - [Service/Account]
- [Date]: $X,XXX (XX% above baseline) - [Service/Account]
Baseline range: $X,XXX - $X,XXX
Normal mean: $X,XXX
Current level: $X,XXX (within/outside normal range)
6. New or Changed Resources
New High-Cost Resources Detected:
| Resource | Service | Account | First Seen | Current Cost | Status |
|---|
| [Resource ID] | EC2 | [Account] | [Date] | $X,XXX/mo | ⚠️ Review |
| [Resource ID] | RDS | [Account] | [Date] | $X,XXX/mo | ⚠️ Review |
Recently Changed Resources:
| Resource | Service | Change Type | Date | Impact |
|---|
| [Resource ID] | EC2 | Size increase | [Date] | +$XXX/mo |
| [Resource ID] | RDS | Multi-AZ enabled | [Date] | +$XXX/mo |
7. Security and Compliance Concerns
Potential Security Issues:
- [Issue description]
- Indicators: [What suggests this is a security issue]
- Affected resources: [Details]
- Recommended action: [Contact security team, isolate resource, etc.]
Potential Compliance Issues:
- [Issue description]
- Compliance requirement: [Which policy/standard]
- Violation: [What's non-compliant]
- Remediation: [Steps to fix]
8. Waste and Optimization Opportunities
Identified Waste:
- [Type of waste] - $X,XXX/month
- Description: [Details]
- How to fix: [Steps]
- Savings potential: $X,XXX/month
Optimization Opportunities:
- [Opportunity] - Potential savings: $X,XXX/month
- Current state: [Details]
- Recommended change: [Action]
- Implementation effort: [Low/Medium/High]
9. Baseline Comparison
Current vs. Baseline:
| Metric | Baseline | Current | Variance | Status |
|---|
| Daily Cost | $X,XXX | $X,XXX | +XX% | ⚠️ |
| Weekday Avg | $X,XXX | $X,XXX | +XX% | ⚠️ |
| Weekend Avg | $X,XXX | $X,XXX | +XX% | ✅ |
| Top Service | $X,XXX | $X,XXX | +XX% | ⚠️ |
| Top Account | $X,XXX | $X,XXX | +XX% | ⚠️ |
Statistical Analysis:
- Mean: $X,XXX (baseline: $X,XXX)
- Std Dev: $XXX (baseline: $XXX)
- Current cost is X standard deviations from baseline
- Coefficient of variation: XX% (baseline: XX%)
10. Prioritized Action Plan
Immediate Actions (Within 24 Hours):
-
[Action] - Prevents $X,XXX/month
- Severity: High
- Effort: Low
- Owner: [Suggested owner]
-
[Action] - Prevents $X,XXX/month
Short-Term Actions (This Week):
- [Action] - Potential savings $X,XXX/month
Monitoring and Prevention:
- Set up alerts for [specific anomaly type]
- Review [dimension] daily for next week
- Investigate [specific pattern] further
- Implement [preventive measure]
11. False Positive Assessment
Likely Legitimate (Not True Anomalies):
- [Item]
- Reason: [Why this is expected based on org context]
- Recommendation: Update baseline expectations
Requires Validation:
- [Item]
- Could be legitimate or anomalous
- Recommendation: Verify with [team/person]
Skill-Specific Best Practices
- Establish proper baselines - Need sufficient historical data
- Use statistical methods - Not just absolute thresholds
- Consider day-of-week patterns - Compare apples to apples
- Cross-reference dimensions - Anomalies often span multiple dimensions
- Prioritize by impact - Focus on highest-cost anomalies first
- Check for false positives - Validate against known changes
- Provide context - Explain why something is anomalous
For general cost analysis best practices, see ${CLAUDE_PLUGIN_ROOT}/references/best-practices.md
Anomaly Detection Techniques
Statistical Anomaly Detection
For each data point:
z_score = (value - mean) / stddev
If abs(z_score) > 2:
Flag as anomaly
Percentage-Based Detection
If (current - baseline) / baseline > 0.5:
Flag as 50%+ increase anomaly
Rate-of-Change Detection
day_over_day_change = (today - yesterday) / yesterday
If day_over_day_change > threshold:
Flag as rapid change anomaly
Pattern Matching
- Compare recent pattern to historical patterns
- Detect when current pattern doesn't match any known pattern
- Use day-of-week, time-of-day templates
Clustering
- Group similar cost patterns
- Identify outliers that don't fit any cluster
- Flag new clusters that emerge
Common Anomaly Types
Type 1: Compute Spikes
Indicators:
- Sudden EC2/Lambda/ECS cost increase
- Unusual instance types or sizes
- New regions with compute resources
Causes:
- Auto-scaling event
- New deployment
- Performance testing
- Crypto mining (security issue)
Type 2: Storage Growth
Indicators:
- Gradual or sudden storage cost increase
- S3 bucket growth
- EBS volume increases
Causes:
- Data accumulation (expected or unexpected)
- Backup retention issues
- Log accumulation
- Snapshot proliferation
Type 3: Data Transfer Spikes
Indicators:
- Network/data transfer cost spike
- Cross-region transfer increase
- Internet egress increase
Causes:
- Architecture change
- Data migration
- Security incident (data exfiltration)
- Misconfigured application
Type 4: New Resource Creation
Indicators:
- Resources that didn't exist in baseline
- Costs in new accounts or regions
- New service usage
Causes:
- New project launch (legitimate)
- Developer experimentation
- Unauthorized resource creation
- Security breach
Type 5: Idle or Waste Resources
Indicators:
- Resources with consistent low but non-zero cost
- Detached volumes
- Unused Reserved Instances
Causes:
- Forgotten test resources
- Improper cleanup after projects
- Manual provisioning without automation
Advanced Techniques
Machine Learning Anomaly Detection
If sufficient data:
- Build time-series models (ARIMA, Prophet)
- Predict expected costs
- Flag actual costs that deviate from prediction
Seasonal Adjustment
Account for known seasonal patterns:
- End-of-quarter increased activity
- Holiday seasons
- Business cycle patterns
Multi-Variate Analysis
Look for combinations of factors:
- High cost + new resource + unusual region = high priority
- Low cost + expected service + known account = low priority
Anomaly Correlation
Find related anomalies:
- EC2 spike + data transfer spike might be same event
- Multiple services in same account might share root cause
Tips for Effective Anomaly Detection
- Run regularly - Daily or weekly, not just when problems noticed
- Know your baselines - Understand normal patterns first
- Tune thresholds - Adjust based on organization's tolerance
- Follow up - Track which anomalies were real issues vs. false positives
- Automate - Set up alerts for high-severity anomalies
- Document patterns - Build knowledge base of anomaly types
- Close the loop - Report back on resolution to improve detection
- Balance sensitivity - Too sensitive = alert fatigue, too loose = miss issues
See Also
- understand-cloudzero-organization skill - Load organization context first
${CLAUDE_PLUGIN_ROOT}/references/best-practices.md - Universal cost analysis best practices
${CLAUDE_PLUGIN_ROOT}/references/cloudzero-tools-reference.md - Complete tool documentation
${CLAUDE_PLUGIN_ROOT}/references/error-handling.md - Troubleshooting and common errors
${CLAUDE_PLUGIN_ROOT}/references/dimensions-reference.md - Dimension types and FQDIDs
${CLAUDE_PLUGIN_ROOT}/references/cost-types-reference.md - When to use each cost type