From devops-skills
Generates PromQL queries, alerting/recording rules, and Prometheus dashboards via interactive workflow clarifying goals, metrics, and use cases like Grafana viz or troubleshooting.
npx claudepluginhub akin-ozer/cc-devops-skills --plugin devops-skillsThis skill uses the workspace's default tool permissions.
This skill provides a comprehensive, interactive workflow for generating production-ready PromQL queries with best practices built-in. Generate queries for monitoring dashboards, alerting rules, and ad-hoc analysis with an emphasis on user collaboration and planning before code generation.
examples/alerting_rules.yamlexamples/common_queries.promqlexamples/kubernetes_patterns.promqlexamples/recording_rules.yamlexamples/red_method.promqlexamples/slo_patterns.promqlexamples/use_method.promqlreferences/best_practices.mdreferences/metric_types.mdreferences/promql_functions.mdreferences/promql_patterns.mdtests/test_service_down_portability.pytests/test_time_window_semantics.pySearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
This skill provides a comprehensive, interactive workflow for generating production-ready PromQL queries with best practices built-in. Generate queries for monitoring dashboards, alerting rules, and ad-hoc analysis with an emphasis on user collaboration and planning before code generation.
Invoke this skill when:
CRITICAL: This skill emphasizes interactive planning before query generation. Always engage the user in a collaborative planning process to ensure the generated query matches their exact intentions.
Follow this workflow when generating PromQL queries:
Start by understanding what the user wants to monitor or measure. Ask clarifying questions to gather requirements:
Primary Goal: What are you trying to monitor or measure?
Use Case: What will this query be used for?
Context: Any additional context?
Use the AskUserQuestion tool to gather this information if not provided.
When to Ask vs. Infer: If the user's initial request already clearly specifies the goal, use case, and context (e.g., "Create an alert for P95 latency > 500ms for payment-service"), you may acknowledge these details in your response instead of re-asking. Only ask clarifying questions for information that is missing or ambiguous.
Determine which metrics are available and relevant:
Metric Discovery: What metrics are available?
_total suffix → Counter_bucket, _sum, _count suffix → Histogram_created suffix → Counter creation timestampMetric Type Identification: Confirm the metric type(s)
http_requests_total, errors_total, bytes_sent_totalrate(), irate(), increase()memory_usage_bytes, cpu_temperature_celsius, queue_lengthavg_over_time(), min_over_time(), max_over_time(), or directlyhttp_request_duration_seconds_bucket, response_size_bytes_buckethistogram_quantile(), rate()rpc_duration_seconds{quantile="0.95"}_sum and _count for averages; don't average quantilesLabel Discovery: What labels are available on these metrics?
job, instance, environment, service, endpoint, status_code, methodUse the AskUserQuestion tool to confirm metric names, types, and available labels.
Gather specific requirements for the query.
IMPORTANT: When the user has already specified parameters in their initial request (e.g., "5-minute window", "500ms threshold", "> 5% error rate"), you MUST:
Example: If user says "alert when P95 latency exceeds 500ms", use:
AskUserQuestion:
- Question: "Confirm the alert threshold?"
- Options:
1. "500ms (as specified)" - Use the threshold from your request
2. "Different threshold" - Let me specify a different value
This respects the user's input and speeds up the workflow while still allowing modifications.
Time Range: What time window should the query cover?
[5m], [1h], [1d])[1m] to [5m] for real-time, [1h] to [1d] for trendsLabel Filtering: Which labels should filter the data?
job="api-server", status_code="200"status_code!="200"instance=~"prod-.*"{job="api", environment="production"}Aggregation: Should the data be aggregated?
sum by (job, endpoint), avg by (instance)sum without (instance, pod), avg without (job)sum, avg, max, min, count, topk, bottomkThresholds or Conditions: Are there specific conditions?
Use the AskUserQuestion tool to gather or confirm these parameters. When the user has already provided values (e.g., "5-minute window", "> 5%"), present them as the default option for confirmation.
BEFORE GENERATING ANY CODE, present a plain-English query plan and ask for user confirmation:
## PromQL Query Plan
Based on your requirements, here's what the query will do:
**Goal**: [Describe the monitoring goal in plain English]
**Query Structure**:
1. Start with metric: `[metric_name]`
2. Filter by labels: `{label1="value1", label2="value2"}`
3. Apply function: `[function_name]([metric][time_range])`
4. Aggregate: `[aggregation] by ([label_list])`
5. Additional operations: [any calculations, ratios, or transformations]
**Expected Output**:
- Data type: [instant vector/scalar]
- Labels in result: [list of labels]
- Value represents: [what the number means]
- Typical range: [expected value range]
**Example Interpretation**:
If the query returns `0.05`, it means: [plain English explanation]
**Does this match your intentions?**
- If yes, I'll generate the query and validate it
- If no, let me know what needs to change
Use the AskUserQuestion tool to confirm the plan with options:
When the user chooses:
Once the user confirms the plan, generate the actual PromQL query following best practices.
Before writing any query code, you MUST:
Identify the query category first (histogram, RED, USE, function-specific, optimization, etc.).
Read only the relevant reference section(s) using the Read tool:
references/metric_types.md (Histogram section)references/promql_patterns.md (RED method section)references/promql_patterns.md (USE method section)references/best_practices.mdreferences/promql_functions.mdIf a needed reference cannot be read, state the issue and continue with best-effort generation using the most applicable documented pattern you already have.
Cite the applicable pattern or best practice in your response:
As documented in references/promql_patterns.md (Pattern 3: Latency Percentile):
# 95th percentile latency
histogram_quantile(0.95, sum by (le) (rate(...)))
Reference example files when generating similar queries:
Based on examples/red_method.promql (lines 64-82):
# P95 latency with proper histogram_quantile usage
This keeps generated queries aligned with documented patterns while avoiding unnecessary full-file rereads on iterative follow-ups.
Always Use Label Filters
# Good: Specific filtering reduces cardinality
rate(http_requests_total{job="api-server", environment="prod"}[5m])
# Bad: Matches all time series, high cardinality
rate(http_requests_total[5m])
Use Appropriate Functions for Metric Types
# Counter: Use rate() or increase()
rate(http_requests_total[5m])
# Gauge: Use directly or with *_over_time()
memory_usage_bytes
avg_over_time(memory_usage_bytes[5m])
# Histogram: Use histogram_quantile()
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
Apply Aggregations with by() or without()
# Aggregate by specific labels (keeps only these labels)
sum by (job, endpoint) (rate(http_requests_total[5m]))
# Aggregate without specific labels (removes these labels)
sum without (instance, pod) (rate(http_requests_total[5m]))
Use Exact Matches Over Regex When Possible
# Good: Faster exact match
http_requests_total{status_code="200"}
# Bad: Slower regex match when not needed
http_requests_total{status_code=~"200"}
Calculate Ratios Properly
# Error rate: errors / total requests
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Use Recording Rules for Complex Queries
level:metric:operationsFormat for Readability
# Good: Multi-line for complex queries
histogram_quantile(0.95,
sum by (le, job) (
rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
)
)
Pattern 1: Request Rate
# Requests per second
rate(http_requests_total{job="api-server"}[5m])
# Total requests per second across all instances
sum(rate(http_requests_total{job="api-server"}[5m]))
Pattern 2: Error Rate
# Error ratio (0 to 1)
sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api-server"}[5m]))
# Error percentage (0 to 100)
(
sum(rate(http_requests_total{job="api-server", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api-server"}[5m]))
) * 100
Pattern 3: Latency Percentile (Histogram)
# 95th percentile latency
histogram_quantile(0.95,
sum by (le) (
rate(http_request_duration_seconds_bucket{job="api-server"}[5m])
)
)
Pattern 4: Resource Usage
# Current memory usage
process_resident_memory_bytes{job="api-server"}
# Average CPU usage over 5 minutes
avg_over_time(process_cpu_seconds_total{job="api-server"}[5m])
Pattern 5: Availability
# Percentage of up instances
(
count(up{job="api-server"} == 1)
/
count(up{job="api-server"})
) * 100
Pattern 6: Saturation/Queue Depth
# Average queue length
avg_over_time(queue_depth{job="worker"}[5m])
# Maximum queue depth in the last hour
max_over_time(queue_depth{job="worker"}[1h])
ALWAYS attempt to validate the generated query first using the devops-skills:promql-validator skill:
After generating the query, automatically invoke:
Skill(devops-skills:promql-validator)
The devops-skills:promql-validator skill will:
1. Check syntax correctness
2. Validate semantic logic (correct functions for metric types)
3. Identify anti-patterns and inefficiencies
4. Suggest optimizations
5. Explain what the query does
6. Verify it matches user intent
Validation checklist:
If validation fails, fix issues and re-validate until all checks pass.
If the validator skill is unavailable, fails to run, or cannot complete after two fix/re-validate cycles:
IMPORTANT: Display Validation Results to User
After running validation, you MUST display the structured results to the user in this format:
## PromQL Validation Results
### Syntax Check
- Status: ✅ VALID / ⚠️ WARNING / ❌ ERROR / ⚠️ UNVERIFIED
- Issues: [list any syntax errors]
### Best Practices Check
- Status: ✅ OPTIMIZED / ⚠️ CAN BE IMPROVED / ❌ HAS ISSUES / ⚠️ UNVERIFIED
- Issues: [list any problems found]
- Suggestions: [list optimization opportunities]
### Validation Coverage
- Validator tool run: [successful / failed / unavailable]
- Checks completed: [syntax, semantics, anti-patterns, performance, intent-match]
- Checks skipped: [list any skipped checks, or "None"]
### Query Explanation
- **What it measures**: [plain English description]
- **Output labels**: [list labels in result, or "None (scalar)"]
- **Expected result structure**: [instant vector / scalar / etc.]
This transparency helps users understand the validation process and any recommendations.
After generation and validation (or manual fallback validation), provide the user with:
The Final Query:
[Generated and validated PromQL query]
Query Explanation:
How to Use It:
Customization Notes:
Related Queries:
Native histograms are now stable in Prometheus 3.0+ (released November 2024). They offer significant advantages over classic histograms:
Important: Starting with Prometheus v3.8.0, native histograms are fully stable. However, scraping native histograms still requires explicit activation via the
scrape_native_histogramsconfiguration setting. Starting with v3.9, no feature flag is needed butscrape_native_histogramsmust be set explicitly.
# Classic histogram (requires _bucket suffix and le label)
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Native histogram (simpler - no _bucket suffix, no le label needed)
histogram_quantile(0.95,
sum by (job) (rate(http_request_duration_seconds[5m]))
)
# Get observation count rate from native histogram
histogram_count(rate(http_request_duration_seconds[5m]))
# Get sum of observations from native histogram
histogram_sum(rate(http_request_duration_seconds[5m]))
# Calculate fraction of observations between two values
histogram_fraction(0, 0.1, rate(http_request_duration_seconds[5m]))
# Average request duration from native histogram
histogram_sum(rate(http_request_duration_seconds[5m]))
/
histogram_count(rate(http_request_duration_seconds[5m]))
Native histograms are identified by:
_bucket suffix on the metric namele label in the time seriesWhen querying, check if your Prometheus instance has native histograms enabled:
# prometheus.yml - Enable native histogram scraping
scrape_configs:
- job_name: 'my-app'
scrape_native_histogram: true # Prometheus 3.x+
Prometheus 3.4+ supports custom bucket native histograms (schema -53), allowing classic histogram to native histogram conversion. This is a key migration path for users with existing classic histograms.
Benefits of NHCB:
Configuration (Prometheus 3.4+):
# prometheus.yml - Convert classic histograms to NHCB on scrape
global:
scrape_configs:
- job_name: 'my-app'
convert_classic_histograms_to_nhcb: true # Prometheus 3.4+
Querying NHCB:
# Query NHCB metrics the same way as native histograms
histogram_quantile(0.95, sum by (job) (rate(http_request_duration_seconds[5m])))
# histogram_fraction also works with NHCB (Prometheus 3.4+)
histogram_fraction(0, 0.2, rate(http_request_duration_seconds[5m]))
Note: Schema -53 indicates custom bucket boundaries. These histograms with different custom bucket boundaries are generally not mergeable with each other.
Service Level Objectives (SLOs) are critical for modern SRE practices. These patterns help implement SLO-based monitoring and alerting.
# Error budget remaining (for 99.9% SLO over 30 days)
# Returns value between 0 and 1 (1 = full budget, 0 = exhausted)
1 - (
sum(rate(http_requests_total{job="api", status_code=~"5.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))
) / 0.001 # 0.001 = 1 - 0.999 (allowed error rate)
# Simplified: Availability over 30 days
sum(rate(http_requests_total{job="api", status_code!~"5.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))
Burn rate measures how fast you're consuming error budget. A burn rate of 1 means you'll exhaust the budget exactly at the end of the SLO window.
# Current burn rate (1 hour window, 99.9% SLO)
# Burn rate = (current error rate) / (allowed error rate)
(
sum(rate(http_requests_total{job="api", status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="api"}[1h]))
) / 0.001 # 0.001 = allowed error rate for 99.9% SLO
# Burn rate > 1 means consuming budget faster than allowed
# Burn rate of 14.4 consumes 2% of monthly budget in 1 hour
The recommended approach for SLO alerting uses multiple windows to balance detection speed and precision:
# Page-level alert: 2% budget in 1 hour (burn rate 14.4)
# Long window (1h) AND short window (5m) must both exceed threshold
(
(
sum(rate(http_requests_total{job="api", status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="api"}[1h]))
) > 14.4 * 0.001
)
and
(
(
sum(rate(http_requests_total{job="api", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
) > 14.4 * 0.001
)
# Ticket-level alert: 5% budget in 6 hours (burn rate 6)
(
(
sum(rate(http_requests_total{job="api", status_code=~"5.."}[6h]))
/
sum(rate(http_requests_total{job="api"}[6h]))
) > 6 * 0.001
)
and
(
(
sum(rate(http_requests_total{job="api", status_code=~"5.."}[30m]))
/
sum(rate(http_requests_total{job="api"}[30m]))
) > 6 * 0.001
)
Pre-compute SLO metrics for efficient alerting:
# Recording rules for SLO calculations
groups:
- name: slo_recording_rules
interval: 30s
rules:
# Error ratio over different windows
- record: job:slo_errors_per_request:ratio_rate1h
expr: |
sum by (job) (rate(http_requests_total{status_code=~"5.."}[1h]))
/
sum by (job) (rate(http_requests_total[1h]))
- record: job:slo_errors_per_request:ratio_rate5m
expr: |
sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
# Availability (success ratio)
- record: job:slo_availability:ratio_rate1h
expr: |
1 - job:slo_errors_per_request:ratio_rate1h
# Percentage of requests faster than SLO target (200ms)
(
sum(rate(http_request_duration_seconds_bucket{le="0.2", job="api"}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
) * 100
# Requests violating latency SLO (slower than 500ms)
(
sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
-
sum(rate(http_request_duration_seconds_bucket{le="0.5", job="api"}[5m]))
)
/
sum(rate(http_request_duration_seconds_count{job="api"}[5m]))
| Burn Rate | Budget Consumed | Time to Exhaust 30-day Budget | Alert Severity |
|---|---|---|---|
| 1 | 100% over 30d | 30 days | None |
| 2 | 100% over 15d | 15 days | Low |
| 6 | 5% in 6h | 5 days | Ticket |
| 14.4 | 2% in 1h | ~2 days | Page |
| 36 | 5% in 1h | ~20 hours | Page (urgent) |
Subqueries enable complex time-based calculations:
# Maximum 5-minute rate over the past 30 minutes
max_over_time(
rate(http_requests_total[5m])[30m:1m]
)
Syntax: <query>[<range>:<resolution>]
<range>: Time window to evaluate over<resolution>: Step size between evaluationsCompare current data with historical data:
# Compare current rate with rate from 1 week ago
rate(http_requests_total[5m])
-
rate(http_requests_total[5m] offset 1w)
Query metrics at specific timestamps:
# Rate at the end of the range query
rate(http_requests_total[5m] @ end())
# Rate at specific Unix timestamp
rate(http_requests_total[5m] @ 1609459200)
Combine metrics with operators and control label matching:
# One-to-one matching (default)
metric_a + metric_b
# Many-to-one with group_left
rate(http_requests_total[5m])
* on (job, instance) group_left (version)
app_version_info
# Ignoring specific labels
metric_a + ignoring(instance) metric_b
Filter time series based on conditions:
# Return series only where value > 100
http_requests_total > 100
# Return series present in both
metric_a and metric_b
# Return series in A but not in B
metric_a unless metric_b
If the user asks about specific Prometheus features, operators, or custom metrics:
Try context7 MCP first (preferred):
Use mcp__context7__resolve-library-id with "prometheus"
Then use mcp__context7__get-library-docs with:
- context7CompatibleLibraryID: /prometheus/docs
- topic: [specific feature, function, or operator]
- page: 1 (fetch additional pages if needed)
Fallback to WebSearch:
Search query pattern:
"Prometheus PromQL [function/operator/feature] documentation [version] examples"
Examples:
"Prometheus PromQL rate function documentation examples"
"Prometheus PromQL histogram_quantile documentation best practices"
"Prometheus PromQL aggregation operators documentation"
Rate: Request throughput
sum(rate(http_requests_total{job="api"}[5m])) by (endpoint)
Errors: Error rate
sum(rate(http_requests_total{job="api", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
Duration: Latency percentiles
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m]))
)
Utilization: Resource usage percentage
(
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
count(node_cpu_seconds_total{mode="idle"})
) * 100
Saturation: Queue depth or resource contention
avg_over_time(node_load1[5m])
Errors: Error counters
rate(node_network_receive_errs_total[5m])
When generating queries for alerting:
Include the Threshold: Make the condition explicit
# Alert when error rate exceeds 5%
(
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
Use Boolean Operators: Return 1 (fire) or 0 (no alert)
# Returns 1 when memory usage > 90%
(process_resident_memory_bytes / node_memory_MemTotal_bytes) > 0.9
Consider for Duration: Alerts typically use for clause
alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 10m # Only fire after 10 minutes of continuous violation
When generating queries for recording rules:
Follow Naming Convention: level:metric:operations
# level: aggregation level (job, instance, etc.)
# metric: base metric name
# operations: functions applied
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
Pre-aggregate Expensive Queries:
# Recording rule for frequently-used latency query
- record: job_endpoint:http_request_duration_seconds:p95
expr: |
histogram_quantile(0.95,
sum by (job, endpoint, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
Use Recorded Metrics in Dashboards:
# Instead of expensive query, use pre-recorded metric
job_endpoint:http_request_duration_seconds:p95{job="api-server"}
Empty Results:
up{job="your-job"}Too Many Series (High Cardinality):
Incorrect Values:
Performance Issues:
When generating queries:
If a structured question tool is unavailable, continue with an explicit inline questionnaire in plain text:
Use references deterministically, but avoid unnecessary reads for trivial requests.
Read references when ANY of the following is true:
Skip reference reads only when ALL of the following are true:
rate, increase, sum, avg, max, min)When skipping, explicitly state: Reference read skipped (trivial case) and keep validation mandatory.
After generating any PromQL query, automatically invoke the devops-skills:promql-validator skill to ensure quality:
Steps:
1. Generate the PromQL query based on user requirements
2. Invoke devops-skills:promql-validator skill with the generated query
3. Review validation results (syntax, semantics, performance)
4. Fix any issues identified by the validator
5. Re-validate until all checks pass
6. Provide the final validated query with usage instructions
7. Ask user if further refinements are needed
This ensures all generated queries follow best practices and are production-ready.
IMPORTANT: Explicit Reference Consultation
When generating queries, you SHOULD explicitly read the relevant reference files using the Read tool and cite applicable best practices. This ensures generated queries follow documented patterns and helps users understand why certain approaches are recommended.
promql_functions.md
promql_patterns.md
best_practices.md
metric_types.md
common_queries.promql
red_method.promql
use_method.promql
alerting_rules.yaml
recording_rules.yaml
slo_patterns.promql
kubernetes_patterns.promql
A successful query generation session should meet these measurable checkpoints:
consulted with file names OR skipped (trivial case) with reason.The goal is to collaboratively plan and generate PromQL queries that exactly match user intentions. Always prioritize clarity, correctness, and performance. The interactive planning phase is the most important part of this skill—never skip it!