This skill should be used when the user asks about "Temporal monitoring", "Prometheus Temporal", "Grafana Temporal", "Temporal metrics", "Temporal alerts", "monitor workflows", "Temporal observability", or needs guidance on setting up monitoring for Temporal clusters.
From timelordnpx claudepluginhub therealbill/mynet --plugin timelordThis skill uses the workspace's default tool permissions.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Guidance for configuring monitoring, alerting, and observability for Temporal clusters.
Recommended stack:
Temporal exposes Prometheus metrics on port 9090 for each service.
| Category | Prefix | Purpose |
|---|---|---|
| Service | temporal_* | Service health, latency |
| Persistence | temporal_persistence_* | Database operations |
| Workflow | temporal_workflow_* | Workflow execution |
| Activity | temporal_activity_* | Activity execution |
| Task Queue | temporal_task_* | Task dispatch |
# prometheus.yml
scrape_configs:
- job_name: 'temporal-frontend'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
regex: frontend
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "9090"
action: keep
- job_name: 'temporal-history'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
regex: history
action: keep
- job_name: 'temporal-matching'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
regex: matching
action: keep
- job_name: 'temporal-worker'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
regex: worker
action: keep
Enable Prometheus in Helm values:
prometheus:
enabled: true
nodeExporter:
enabled: false
server:
metrics:
prometheus:
timerType: histogram
# Frontend request rate
sum(rate(temporal_frontend_requests_total[5m])) by (operation)
# Frontend errors
sum(rate(temporal_frontend_errors_total[5m])) by (operation, error_type)
# Frontend latency p99
histogram_quantile(0.99,
sum(rate(temporal_frontend_request_latency_bucket[5m])) by (le, operation)
)
# Persistence request rate
sum(rate(temporal_persistence_requests_total[5m])) by (operation)
# Persistence latency p99
histogram_quantile(0.99,
sum(rate(temporal_persistence_latency_bucket[5m])) by (le, operation)
)
# Persistence errors
sum(rate(temporal_persistence_errors_total[5m])) by (operation, error_type)
# Workflow start rate
sum(rate(temporal_workflow_started_total[5m])) by (namespace, workflow_type)
# Workflow completion rate
sum(rate(temporal_workflow_completed_total[5m])) by (namespace, workflow_type)
# Workflow failure rate
sum(rate(temporal_workflow_failed_total[5m])) by (namespace, workflow_type)
# Workflow execution latency
histogram_quantile(0.99,
sum(rate(temporal_workflow_endtoend_latency_bucket[5m])) by (le, workflow_type)
)
# Schedule-to-start latency (task wait time)
histogram_quantile(0.99,
sum(rate(temporal_schedule_to_start_latency_bucket[5m])) by (le, task_queue)
)
# Task dispatch rate
sum(rate(temporal_task_dispatch_total[5m])) by (task_queue, task_type)
# Task backlog
temporal_task_queue_depth
Temporal provides official Grafana dashboards:
# Dashboard IDs for import
# Server Overview: 10270
# SDK Metrics: 10271
Service Health Panel:
{
"title": "Frontend Request Rate",
"type": "graph",
"targets": [{
"expr": "sum(rate(temporal_frontend_requests_total[5m])) by (operation)",
"legendFormat": "{{operation}}"
}]
}
Task Queue Latency Panel:
{
"title": "Schedule-to-Start Latency (p99)",
"type": "graph",
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(temporal_schedule_to_start_latency_bucket[5m])) by (le, task_queue))",
"legendFormat": "{{task_queue}}"
}]
}
# prometheus-rules.yaml
groups:
- name: temporal-critical
rules:
- alert: TemporalServiceDown
expr: up{job=~"temporal-.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Temporal service {{ $labels.job }} is down"
- alert: TemporalPersistenceHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(temporal_persistence_latency_bucket[5m])) by (le, operation)
) > 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Database latency p99 > 500ms for {{ $labels.operation }}"
- alert: TemporalPersistenceErrors
expr: |
sum(rate(temporal_persistence_errors_total[5m])) by (operation) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Persistence errors detected for {{ $labels.operation }}"
- name: temporal-warning
rules:
- alert: TemporalHighScheduleToStartLatency
expr: |
histogram_quantile(0.99,
sum(rate(temporal_schedule_to_start_latency_bucket[5m])) by (le, task_queue)
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Task queue {{ $labels.task_queue }} has high latency"
description: "Tasks are waiting >10s to start. Consider scaling workers."
- alert: TemporalHighFrontendLatency
expr: |
histogram_quantile(0.99,
sum(rate(temporal_frontend_request_latency_bucket[5m])) by (le, operation)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Frontend latency high for {{ $labels.operation }}"
- alert: TemporalWorkflowFailureRate
expr: |
sum(rate(temporal_workflow_failed_total[5m])) /
sum(rate(temporal_workflow_completed_total[5m])) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "Workflow failure rate > 5%"
- alert: TemporalTaskQueueBacklog
expr: temporal_task_queue_depth > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Task queue {{ $labels.task_queue }} backlog > 1000"
Configure SDK metrics in your workers:
import (
"go.temporal.io/sdk/client"
"go.temporal.io/sdk/contrib/opentelemetry"
"go.temporal.io/sdk/worker"
"github.com/uber-go/tally/v4"
"github.com/uber-go/tally/v4/prometheus"
)
func main() {
// Create Prometheus reporter
reporter := prometheus.NewReporter(prometheus.Options{})
scope, closer := tally.NewRootScope(tally.ScopeOptions{
Tags: map[string]string{"service": "my-worker"},
CachedReporter: reporter,
Separator: prometheus.DefaultSeparator,
}, time.Second)
defer closer.Close()
// Create client with metrics
c, _ := client.Dial(client.Options{
MetricsHandler: sdktally.NewMetricsHandler(scope),
})
defer c.Close()
// Create worker
w := worker.New(c, "task-queue", worker.Options{})
// ...
}
# Activity execution latency
temporal_activity_execution_latency_bucket
# Workflow task latency
temporal_workflow_task_execution_latency_bucket
# Worker task slots
temporal_worker_task_slots_available
Configure structured logging for correlation:
# Temporal server config
log:
stdout: true
level: info
outputFile: ""
Include workflow/run IDs in worker logs:
func YourActivity(ctx context.Context, input Input) error {
logger := activity.GetLogger(ctx)
logger.Info("Processing",
"workflowID", activity.GetInfo(ctx).WorkflowExecution.ID,
"runID", activity.GetInfo(ctx).WorkflowExecution.RunID,
"activityID", activity.GetInfo(ctx).ActivityID,
)
// ...
}
livenessProbe:
httpGet:
path: /health
port: 7233
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 7233
initialDelaySeconds: 10
periodSeconds: 5
| Symptom | Metrics to Check | Likely Cause |
|---|---|---|
| Slow workflows | schedule_to_start_latency | Need more workers |
| Workflow failures | workflow_failed_total | Check activity errors |
| API errors | frontend_errors_total | Auth, rate limits |
| DB issues | persistence_latency | Database performance |
For complete metric reference, consult:
references/metrics-reference.md - Complete metric documentationreferences/dashboard-json.md - Grafana dashboard definitions