From agent-almanac
Configures centralized log aggregation with Loki/Promtail or ELK stack, including parsing, label extraction, retention policies, and metrics correlation for multi-service troubleshooting.
npx claudepluginhub pjt222/agent-almanacThis skill is limited to using the following tools:
Implement centralized log collection, parsing, and querying with Loki/Promtail or ELK stack for operational visibility.
Deploys ELK Stack, Grafana Loki, or Splunk for centralized log aggregation with shippers, parsing rules, retention policies, dashboards, alerting, and RBAC on Docker or Kubernetes.
References Grafana Loki LogQL syntax for log/metric queries, parsers (json/logfmt/pattern/regexp/unpack), label filters, architecture, and ingestion via Promtail/Alloy/Fluent Bit. For writing queries, configuring pipelines, troubleshooting logs.
Implements centralized log aggregation with ELK Stack, Loki, or Splunk for collecting, parsing, storing, and analyzing logs across infrastructure. Useful for monitoring, debugging, and compliance.
Share bugs, ideas, or general feedback.
Implement centralized log collection, parsing, and querying with Loki/Promtail or ELK stack for operational visibility.
See Extended Examples for complete configuration files and templates.
Select between Loki (Prometheus-style) or ELK (Elasticsearch-based) based on requirements.
Loki advantages:
ELK advantages:
For this guide, we'll focus on Loki + Promtail (recommended for most modern setups).
Decision criteria:
Use Loki if:
- You want label-based queries similar to Prometheus
- Storage costs are a concern (Loki indexes only labels)
- You already use Grafana for metrics
- Kubernetes/container-native deployment
Use ELK if:
- You need full-text search across all log content
- You have complex log parsing and enrichment requirements
- You require advanced analytics and aggregations
- Legacy systems with existing Logstash pipelines
Expected: Clear choice made based on requirements, team downloads appropriate installation artifacts.
On failure:
Install and configure Loki with appropriate storage backend.
Docker Compose deployment (docker-compose.yml):
version: '3.8'
services:
loki:
image: grafana/loki:2.9.0
ports:
- "3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/local-config.yaml
- loki-data:/loki
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
promtail:
image: grafana/promtail:2.9.0
volumes:
- ./promtail-config.yml:/etc/promtail/config.yml
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
command: -config.file=/etc/promtail/config.yml
restart: unless-stopped
depends_on:
- loki
volumes:
loki-data:
Loki configuration (loki-config.yml):
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
# ... (see EXAMPLES.md for complete configuration)
For production with S3 storage:
storage_config:
aws:
s3: s3://us-east-1/my-loki-bucket
s3forcepathstyle: true
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
shared_store: s3
Expected: Loki starts successfully, health check passes at http://localhost:3100/ready, logs stored according to retention policy.
On failure:
docker logs lokidocker run grafana/loki:2.9.0 -config.file=/etc/loki/local-config.yaml -verify-configSet up Promtail to scrape logs and forward to Loki with label extraction.
Promtail configuration (promtail-config.yml):
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
# ... (see EXAMPLES.md for complete configuration)
Key Promtail concepts:
Expected: Promtail scrapes configured log files, labels applied correctly, logs visible in Loki via LogQL queries.
On failure:
docker logs promtaildocker exec promtail ls /var/logcurl http://localhost:9080/metrics | grep promtailcat /tmp/positions.yamlLearn LogQL syntax for filtering and aggregating logs.
Basic queries:
# All logs from a job
{job="app"}
# Logs with specific label values
{job="app", level="error"}
# Regex filter on log line content
{job="app"} |~ "authentication failed"
# Case-insensitive regex
{job="app"} |~ "(?i)error"
# Line filter (doesn't parse, just includes/excludes)
{job="app"} |= "user" # Contains "user"
{job="app"} != "debug" # Doesn't contain "debug"
Parsing and filtering:
# JSON parsing
{job="app"} | json | level="error"
# Regex parsing with named groups
{job="app"} | regexp "user_id=(?P<user_id>\\d+)" | user_id="12345"
# Logfmt parsing (key=value format)
{job="app"} | logfmt | level="error", service="auth"
# Pattern parsing
{job="nginx"} | pattern `<ip> - <user> [<timestamp>] "<method> <path> <protocol>" <status> <size>` | status >= 500
Aggregations (metrics from logs):
# Count log lines per level
sum by (level) (count_over_time({job="app"}[5m]))
# Rate of error logs
rate({job="app", level="error"}[5m])
# Bytes processed per service
sum by (service) (bytes_over_time({job="app"}[1h]))
# Average request duration from logs
avg_over_time({job="app"} | json | unwrap duration [5m])
# Top 10 error messages
topk(10, sum by (message) (count_over_time({level="error"} [1h])))
Filtering by extracted fields:
# Find specific trace in logs
{job="app"} | json | trace_id="abc123def456"
# HTTP 5xx errors from nginx
{job="nginx"} | pattern `<_> "<_> <_> <_>" <status> <_>` | status >= 500
# Failed authentication attempts
{job="app"} | json | message=~"authentication failed" | user_id != ""
Create Grafana explore queries or dashboard panels using these patterns.
Expected: Queries return expected log lines, filtering works correctly, aggregations produce metrics from logs.
On failure:
curl http://localhost:3100/loki/api/v1/labelscurl http://localhost:3100/loki/api/v1/label/{label_name}/valuesCorrelate logs with Prometheus metrics and distributed traces for unified observability.
Add trace IDs to logs (application instrumentation):
# Python with OpenTelemetry
import logging
from opentelemetry import trace
logger = logging.getLogger(__name__)
def handle_request():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info(
"Processing request",
extra={"trace_id": format(trace_id, "032x")}
)
// Go with OpenTelemetry
import (
"go.opentelemetry.io/otel/trace"
"go.uber.org/zap"
)
func handleRequest(ctx context.Context) {
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
logger.Info("Processing request",
zap.String("trace_id", traceID),
)
}
Configure Grafana data links from metrics to logs:
In Prometheus panel field config:
{
"fieldConfig": {
"defaults": {
"links": [
{
"title": "View Logs",
"url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"refId\":\"A\",\"expr\":\"{job=\\\"app\\\",instance=\\\"${__field.labels.instance}\\\"} |= `${__field.labels.trace_id}`\"}],\"range\":{\"from\":\"${__from}\",\"to\":\"${__to}\"}}",
"targetBlank": false
}
]
}
}
}
Configure Grafana data links from logs to traces:
In Loki datasource config:
datasources:
- name: Loki
type: loki
url: http://loki:3100
jsonData:
derivedFields:
- datasourceName: Tempo
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: "$${__value.raw}"
Correlate logs in Grafana Explore:
Expected: Clicking metrics opens related logs, trace IDs in logs link to trace viewer, single pane for metrics/logs/traces navigation.
On failure:
Configure retention policies and compaction to manage storage costs.
Retention by stream (in Loki config):
limits_config:
retention_period: 720h # Global default: 30 days
# Per-tenant retention (requires multi-tenancy enabled)
per_tenant_override_config: /etc/loki/overrides.yaml
# overrides.yaml
overrides:
production:
retention_period: 2160h # 90 days for production
staging:
retention_period: 360h # 15 days for staging
development:
retention_period: 168h # 7 days for dev
Retention by stream labels (requires compactor):
compactor:
working_directory: /loki/compactor
shared_store: filesystem
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
# ... (see EXAMPLES.md for complete configuration)
Priority determines which rule applies when multiple match (lower number = higher priority).
Compression settings:
chunk_store_config:
chunk_cache_config:
enable_fifocache: true
fifocache:
max_size_bytes: 1GB
ttl: 24h
# ... (see EXAMPLES.md for complete configuration)
Monitor retention:
# Check chunk stats
curl http://localhost:3100/loki/api/v1/status/chunks | jq
# Check compactor metrics
curl http://localhost:3100/metrics | grep loki_compactor
# Verify deleted chunks
curl http://localhost:3100/metrics | grep loki_boltdb_shipper_retention_deleted
Expected: Old logs automatically deleted per retention policy, storage usage stabilizes, compaction reduces index size.
On failure:
docker logs loki | grep compactordu -sh /loki/curl http://localhost:3100/readyretention_enabled: true and retention_deletes_enabled: true.ingestion_rate_mb and ingestion_burst_size_mb.correlate-observability-signals - Unified debugging across metrics, logs, and traces using trace IDsbuild-grafana-dashboards - Visualize log-derived metrics and create log panels in dashboardssetup-prometheus-monitoring - Metrics provide context for when to query logs during incidentsinstrument-distributed-tracing - Add trace IDs to logs for correlation with distributed traces