troubleshooting-astro-deployments
Troubleshoot Astronomer production deployments with Astro CLI. Use when investigating deployment issues, viewing production logs, analyzing failures, or managing deployment environment variables.
From data-engineeringnpx claudepluginhub astronomer/agents --plugin dataThis skill uses the workspace's default tool permissions.
Astro Deployment Troubleshooting
This skill helps you diagnose and troubleshoot production Astronomer deployments using the Astro CLI.
For deployment management, see the managing-astro-deployments skill. For local development, see the managing-astro-local-env skill.
Quick Health Check
Start with these commands to get an overview:
# 1. List deployments to find target
astro deployment list
# 2. Get deployment overview
astro deployment inspect <DEPLOYMENT_ID>
# 3. Check for errors
astro deployment logs <DEPLOYMENT_ID> --error -c 50
Viewing Deployment Logs
Use -c to control log count (default: 500). Log flags cannot be combined — use one component or level flag per command.
Component-Specific Logs
View logs from specific Airflow components:
# Scheduler logs (DAG processing, task scheduling)
astro deployment logs <DEPLOYMENT_ID> --scheduler -c 50
# Worker logs (task execution)
astro deployment logs <DEPLOYMENT_ID> --workers -c 30
# Webserver logs (UI access, health checks)
astro deployment logs <DEPLOYMENT_ID> --webserver -c 30
# Triggerer logs (deferrable operators)
astro deployment logs <DEPLOYMENT_ID> --triggerer -c 30
Log Level Filtering
Filter by severity:
# Error logs only (most useful for troubleshooting)
astro deployment logs <DEPLOYMENT_ID> --error -c 30
# Warning logs
astro deployment logs <DEPLOYMENT_ID> --warn -c 50
# Info-level logs
astro deployment logs <DEPLOYMENT_ID> --info -c 50
Search Logs
Search for specific keywords:
# Search for specific error
astro deployment logs <DEPLOYMENT_ID> --keyword "ConnectionError"
# Search for specific DAG
astro deployment logs <DEPLOYMENT_ID> --keyword "my_dag_name" -c 100
# Find import errors
astro deployment logs <DEPLOYMENT_ID> --error --keyword "ImportError"
# Find task failures
astro deployment logs <DEPLOYMENT_ID> --error --keyword "Task failed"
Complete Investigation Workflow
Step 1: Identify the Problem
# List deployments with status
astro deployment list
# Get deployment details
astro deployment inspect <DEPLOYMENT_ID>
Look for:
- Status: HEALTHY vs UNHEALTHY
- Runtime version compatibility
- Resource limits (CPU, memory)
- Recent deployment timestamp
Step 2: Check Error Logs
# Start with errors
astro deployment logs <DEPLOYMENT_ID> --error -c 50
Look for:
- Recurring error patterns
- Specific DAGs failing repeatedly
- Import errors or syntax errors
- Connection or credential errors
Step 3: Review Scheduler Logs
# Check DAG processing
astro deployment logs <DEPLOYMENT_ID> --scheduler -c 30
Look for:
- DAG parse errors
- Scheduling delays
- Task queueing issues
Step 4: Check Worker Logs
# Check task execution
astro deployment logs <DEPLOYMENT_ID> --workers -c 30
Look for:
- Task execution failures
- Resource exhaustion
- Timeout errors
Step 5: Verify Configuration
# Check environment variables
astro deployment variable list --deployment-id <DEPLOYMENT_ID>
# Verify deployment settings
astro deployment inspect <DEPLOYMENT_ID>
Look for:
- Missing or incorrect environment variables
- Secrets configuration (AIRFLOW__SECRETS__BACKEND)
- Connection configuration
Common Investigation Patterns
Recurring DAG Failures
Follow the complete investigation workflow above, then narrow to the specific DAG:
astro deployment logs <DEPLOYMENT_ID> --keyword "my_dag_name" -c 100
Resource Issues
# 1. Check deployment resource allocation
astro deployment inspect <DEPLOYMENT_ID>
# Look for: resource_quota_cpu, resource_quota_memory
# Worker queue: max_worker_count, worker_type
# 2. Check for worker scaling issues
astro deployment logs <DEPLOYMENT_ID> --workers -c 50
# 3. Look for out-of-memory errors
astro deployment logs <DEPLOYMENT_ID> --error --keyword "memory"
Configuration Problems
# 1. Review environment variables
astro deployment variable list --deployment-id <DEPLOYMENT_ID>
# 2. Check for secrets backend configuration
# Look for: AIRFLOW__SECRETS__BACKEND, AIRFLOW__SECRETS__BACKEND_KWARGS
# 3. Verify deployment settings
astro deployment inspect <DEPLOYMENT_ID>
# 4. Check webserver logs for auth issues
astro deployment logs <DEPLOYMENT_ID> --webserver -c 30
Import Errors
# 1. Find import errors
astro deployment logs <DEPLOYMENT_ID> --error --keyword "ImportError"
# 2. Check scheduler for parse failures
astro deployment logs <DEPLOYMENT_ID> --scheduler --keyword "Failed to import" -c 50
# 3. Verify dependencies were deployed
astro deployment inspect <DEPLOYMENT_ID>
# Check: current_tag, last deployment timestamp
Environment Variables Management
List Variables
# List all variables for deployment
astro deployment variable list --deployment-id <DEPLOYMENT_ID>
# Find specific variable
astro deployment variable list --deployment-id <DEPLOYMENT_ID> --key AWS_REGION
# Export variables to file
astro deployment variable list --deployment-id <DEPLOYMENT_ID> --save --env .env.backup
Create Variables
# Create regular variable
astro deployment variable create --deployment-id <DEPLOYMENT_ID> \
--key API_ENDPOINT \
--value https://api.example.com
# Create secret (masked in UI and logs)
astro deployment variable create --deployment-id <DEPLOYMENT_ID> \
--key API_KEY \
--value secret123 \
--secret
Update Variables
# Update existing variable
astro deployment variable update --deployment-id <DEPLOYMENT_ID> \
--key API_KEY \
--value newsecret
Delete Variables
# Delete variable
astro deployment variable delete --deployment-id <DEPLOYMENT_ID> --key OLD_KEY
Note: Variables are available to DAGs as environment variables. Changes require no redeployment.
Key Metrics from deployment inspect
Focus on these fields when troubleshooting:
- status: HEALTHY vs UNHEALTHY
- runtime_version: Airflow version compatibility
- scheduler_size/scheduler_count: Scheduler capacity
- executor: CELERY, KUBERNETES, or LOCAL
- worker_queues: Worker scaling limits and types
min_worker_count,max_worker_countworker_concurrencyworker_type(resource class)
- resource_quota_cpu/memory: Overall resource limits
- dag_deploy_enabled: Whether DAG-only deploys work
- current_tag: Last deployment version
- is_high_availability: Redundancy enabled
Investigation Best Practices
- Always start with error logs - Most obvious failures appear here
- Check error logs for patterns - Same DAG failing repeatedly? Timing patterns?
- Component-specific troubleshooting:
- Worker logs → task execution details
- Scheduler logs → DAG processing and scheduling
- Webserver logs → UI issues and health checks
- Triggerer logs → deferrable operator issues
- Use
--keywordfor targeted searches - More efficient than reading all logs - The
inspectcommand is your health dashboard - Check it first - Environment variables in
inspectoutput - May reveal configuration issues - Log count default is 500 - Adjust with
-cbased on needs - Don't forget to check deployment time - Recent deploy might have introduced issue
Troubleshooting Quick Reference
| Symptom | Command |
|---|---|
| Deployment shows UNHEALTHY | astro deployment inspect <ID> + --error logs |
| DAG not appearing | --error logs for import errors, check --scheduler logs |
| Tasks failing | --workers logs + search for DAG with --keyword |
| Slow scheduling | --scheduler logs + check inspect for scheduler resources |
| UI not responding | --webserver logs |
| Connection issues | Check variables, search logs for connection name |
| Import errors | --error --keyword "ImportError" + --scheduler logs |
| Out of memory | inspect for resources + --workers --keyword "memory" |
Related Skills
- managing-astro-deployments: Create, update, delete deployments, deploy code
- managing-astro-local-env: Manage local Airflow development environment
- setting-up-astro-project: Initialize and configure Astro projects