Execute disaster recovery drill - destroy and recreate cluster from specs
Executes full cluster destruction and recreation from declarative specs for disaster recovery drills.
/plugin marketplace add basher83/lunar-claude/plugin install omni-scale@lunar-claudeFull cluster destruction and recreation from declarative specs.
WARNING: This destroys the entire cluster. Only run for DR drills or actual recovery.
NOTE: Paths below reference user-specific locations. Consider setting $OMNI_SCALE_ROOT if needed.
$INFISICAL_CLIENT_ID, $INFISICAL_CLIENT_SECRET)omnictl get infraproviders
Use AskUserQuestion to confirm:
header: "DR Drill"
question: "This will DESTROY talos-prod-01 and all workloads. Continue?"
options:
- "Yes, execute DR drill"
- "No, abort"
If user selects abort, stop immediately.
mcp__plugin_omni-scale_kubernetes__kubectl_get(resourceType: "nodes")
mcp__plugin_omni-scale_kubernetes__kubectl_get(resourceType: "pv")
omnictl get machines --cluster talos-prod-01
WARNING: No rollback after this phase. Cluster will be destroyed.
omnictl cluster template delete -f ~/dev/infra-as-code/Omni-Scale/clusters/talos-prod-01.yaml --destroy-disconnected-machines
Poll loop:
# Check status
ssh foxtrot "pvesh get /cluster/resources --type vm --output-format json" | jq -r '.[] | select(.name | startswith("talos")) | .name'
# REQUIRED - do not skip or reduce
sleep 30
Repeat until no talos VMs remain. Max wait: 10 min.
After 3 attempts at 30s intervals, INCREASE to 60s. Do not decrease below 30s.
If timeout exceeded, check failure scenarios below.
Use AskUserQuestion to confirm destruction complete before proceeding.
Apply cluster template:
omnictl cluster template sync -f ~/dev/infra-as-code/Omni-Scale/clusters/talos-prod-01.yaml
Poll loop:
# Check status
omnictl get machines --cluster talos-prod-01
# REQUIRED - do not skip or reduce
sleep 30
Repeat until all nodes reach running state. Max wait: 20 min.
After 3 attempts at 30s intervals, INCREASE to 60s. Do not decrease below 30s.
If timeout exceeded, check failure scenarios below.
Poll loop:
# Check status
omnictl kubeconfig talos-prod-01 --merge
kubectl get nodes
# REQUIRED - do not skip or reduce
sleep 30
Repeat until API responds. Max wait: 10 min.
After 3 attempts at 30s intervals, INCREASE to 60s. Do not decrease below 30s.
If timeout exceeded, check failure scenarios below.
Use AskUserQuestion to confirm cluster is ready before GitOps bootstrap.
mcp__plugin_omni-scale_kubernetes__kubectl_get(resourceType: "nodes")
All nodes should show Ready.
Follow /omni-scale:bootstrap-gitops procedure:
Poll loop:
mcp__plugin_omni-scale_kubernetes__kubectl_get(resourceType: "applications", namespace: "argocd")
# REQUIRED - do not skip or reduce
sleep 30
Repeat until all apps show Synced/Healthy. Max wait: 20 min.
After 3 attempts at 30s intervals, INCREASE to 60s. Do not decrease below 30s.
If timeout exceeded, check failure scenarios below.
Manually sync ArgoCD HA after Longhorn is healthy.
Run /omni-scale:status to confirm full recovery.
Compare against pre-DR state if captured.
| Phase | Duration |
|---|---|
| Cluster destruction | 2-5 min |
| VM provisioning | 10-15 min |
| Kubernetes ready | 5-10 min |
| GitOps bootstrap | 10-15 min |
| Total | 30-45 min |
VMs not destroying:
VMs not provisioning:
scripts/provider-ctl.py --logs 50Provider wrong image tag:
:local-fix tag, not :latestNodes not joining:
GitOps apps failing:
Longhorn volumes not attaching: