Skill

oraclecloud-incident-runbook

Runs OCI incident runbook with independent health probes, automated instance recovery, and cross-AD/region failover for outages.

Python

Bash

infrastructure

devops

monitoring

npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin oraclecloud-pack

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBash(oci:*)Bash(python3:*)Grep

Preview

Self-service runbook for when OCI instances go down and the status page stays green. OCI's status page has a history of not acknowledging outages in real time (London Jan 2026 — 502s and instances disappearing for 10 minutes with no status update). OCI Support response times average 4+ hours for Sev-1 tickets. This runbook gives you health probes, automated instance recovery, cross-AD failover,...

Supporting Assets

references/one-pager.md

SKILL.md

Similar Skills

oraclecloud-hello-world

1.9k

Launches OCI compute instances with retry logic for 'Out of host capacity' errors on Always Free ARM shapes. Lists instances, shapes, images; tests OCI connectivity.

1 file6 tools

oraclecloud-pack

cache-components

139.3k

Expert guidance for Next.js Cache Components and Partial Prerendering (PPR). **PROACTIVE ACTIVATION**: Use this skill automatically when working in Next.js projects that have `cacheComponents: true` in their next.config.ts/next.config.js. When this config is detected, proactively apply Cache Components patterns and best practices to all React Server Component implementations. **DETECTION**: At the start of a session in a Next.js project, check for `cacheComponents: true` in next.config. If enabled, this skill's patterns should guide all component authoring, data fetching, and caching decisions. **USE CASES**: Implementing 'use cache' directive, configuring cache lifetimes with cacheLife(), tagging cached data with cacheTag(), invalidating caches with updateTag()/revalidateTag(), optimizing static vs dynamic content boundaries, debugging cache issues, and reviewing Cache Component implementations.

cache-components

mcp-builder

124.2k

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

9 files

anthropics-skills-13

Stats

Parent Repo Stars1854

Parent Repo Forks248

Last CommitMar 23, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Oracle Cloud Incident Runbook

Overview

Purpose: Detect OCI service degradation independently, recover instances automatically, and fail over to alternate availability domains or regions when the primary is impacted.

Prerequisites

OCI CLI installed and configured — ~/.oci/config validated (see oraclecloud-install-auth)
Python 3.8+ with the OCI SDK — pip install oci
Pre-configured resources: at least one compute instance, a VCN with subnets in multiple ADs
IAM policies: manage instances, manage volumes, inspect work-requests in the target compartment
Boot volume backups enabled (recovery depends on having a recent backup)

Instructions

Step 1: Independent Health Probes

Do not trust the OCI status page alone. Run your own health checks against the OCI API:

import oci
import time

config = oci.config.from_file("~/.oci/config")

def probe_oci_health(config):
    """Probe OCI API endpoints independently of the status page."""
    results = {}

    # Probe 1: Identity service (lightest call)
    try:
        start = time.time()
        identity = oci.identity.IdentityClient(config)
        identity.list_regions()
        results["identity"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
    except oci.exceptions.ServiceError as e:
        results["identity"] = {"status": "degraded", "error": str(e.status)}

    # Probe 2: Compute service
    try:
        start = time.time()
        compute = oci.core.ComputeClient(config)
        compute.list_instances(compartment_id=config["tenancy"], limit=1)
        results["compute"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
    except oci.exceptions.ServiceError as e:
        results["compute"] = {"status": "degraded", "error": str(e.status)}

    # Probe 3: Networking service
    try:
        start = time.time()
        network = oci.core.VirtualNetworkClient(config)
        network.list_vcns(compartment_id=config["tenancy"], limit=1)
        results["networking"] = {"status": "healthy", "latency_ms": int((time.time() - start) * 1000)}
    except oci.exceptions.ServiceError as e:
        results["networking"] = {"status": "degraded", "error": str(e.status)}

    return results

health = probe_oci_health(config)
for service, status in health.items():
    print(f"  {service}: {status['status']} ({status.get('latency_ms', 'N/A')}ms)")

Step 2: Severity Classification

Level	Condition	Detection	Response Time
P1 — Total Outage	All API probes fail, instances unreachable	All 3 probes return `degraded`	Immediate — trigger cross-region failover
P2 — Partial Degradation	Some APIs slow (>2s latency), instance agent unresponsive	Latency >2000ms on any probe	15 min — attempt instance recovery, prepare failover
P3 — Intermittent	Sporadic 429/500 errors, some requests succeed	Occasional probe failures	1 hour — enable retries, monitor trend

Step 3: Instance Recovery Actions

# Check current instance state
oci compute instance get --instance-id "$INSTANCE_OCID" \
  --query 'data."lifecycle-state"' --raw-output

# Action: Reset (hard reboot) — fastest recovery for hung instances
oci compute instance action --instance-id "$INSTANCE_OCID" \
  --action RESET

# Action: Reboot (graceful) — for instances still responding to ACPI
oci compute instance action --instance-id "$INSTANCE_OCID" \
  --action SOFTRESET

# Action: Stop then Start — forces reallocation to new host hardware
oci compute instance action --instance-id "$INSTANCE_OCID" \
  --action STOP
# Wait for STOPPED state
oci compute instance action --instance-id "$INSTANCE_OCID" \
  --action START

Step 4: Cross-AD Failover

When an entire availability domain is impacted, launch a replacement instance in a different AD using the latest boot volume backup:

import oci

config = oci.config.from_file("~/.oci/config")
compute = oci.core.ComputeClient(config)
blockstorage = oci.core.BlockstorageClient(config)

# Find latest boot volume backup
backups = blockstorage.list_boot_volume_backups(
    compartment_id="COMPARTMENT_OCID",
    boot_volume_id="BOOT_VOLUME_OCID",
    sort_by="TIMECREATED",
    sort_order="DESC",
    limit=1,
).data

if not backups:
    raise RuntimeError("No boot volume backups found — cannot failover")

latest_backup = backups[0]
print(f"Using backup: {latest_backup.id} from {latest_backup.time_created}")

# Launch replacement instance in alternate AD
launch_details = oci.core.models.LaunchInstanceDetails(
    availability_domain="AD-2",  # Different from the impacted AD
    compartment_id="COMPARTMENT_OCID",
    shape="VM.Standard.E4.Flex",
    shape_config=oci.core.models.LaunchInstanceShapeConfigDetails(ocpus=2, memory_in_gbs=16),
    display_name="failover-instance",
    source_details=oci.core.models.InstanceSourceViaBootVolumeDetails(
        source_type="bootVolume",
        boot_volume_id=latest_backup.id,
    ),
    create_vnic_details=oci.core.models.CreateVnicDetails(
        subnet_id="SUBNET_OCID_IN_AD2",
    ),
)

response = compute.launch_instance(launch_details)
print(f"Failover instance launching: {response.data.id}")

Step 5: Cross-Region Failover

For region-wide outages, use a pre-replicated boot volume in the DR region:

# List available boot volume replicas in DR region
oci bv boot-volume-backup list \
  --compartment-id "$COMPARTMENT_OCID" \
  --query 'data[0].id' --raw-output \
  --region us-phoenix-1

# Launch instance in DR region
oci compute instance launch \
  --compartment-id "$COMPARTMENT_OCID" \
  --availability-domain "AD-1" \
  --shape "VM.Standard.E4.Flex" \
  --shape-config '{"ocpus":2,"memoryInGBs":16}' \
  --display-name "dr-failover-instance" \
  --source-boot-volume-id "$DR_BACKUP_OCID" \
  --subnet-id "$DR_SUBNET_OCID" \
  --region us-phoenix-1

Step 6: Status Monitoring Script

Run this in a loop during an incident to detect recovery:

#!/bin/bash
while true; do
  STATUS=$(oci compute instance get --instance-id "$INSTANCE_OCID" \
    --query 'data."lifecycle-state"' --raw-output 2>&1)
  TIMESTAMP=$(date -u +%H:%M:%S)
  echo "$TIMESTAMP | Instance: $STATUS"
  if [ "$STATUS" = "RUNNING" ]; then
    echo "$TIMESTAMP | Instance recovered."
    break
  fi
  sleep 30
done

Output

Successful execution produces:

Independent health probe results for Identity, Compute, and Networking services
Severity classification (P1/P2/P3) based on probe results
Instance recovery action (reset, reboot, or stop/start) applied to the impacted instance
Cross-AD failover instance launched from the latest boot volume backup (if AD-level failure)
Cross-region failover instance launched in the DR region (if region-level failure)
Continuous monitoring log tracking instance lifecycle state until recovery

Error Handling

Error	Code	Cause	Solution
NotAuthenticated	401	API key expired during incident	Use `oci session authenticate` for token-based short-term auth
NotAuthorizedOrNotFound	404	IAM policy missing for recovery actions	Pre-create policy: `allow group sre to manage instances in compartment prod`
TooManyRequests	429	Rate limited during mass recovery	OCI has no Retry-After header — back off 30 seconds between calls
InternalError	500	OCI service itself is degraded	This confirms the outage — proceed to cross-region failover
No boot volume backups	—	Backups not configured	Cannot failover without backups — enable in Block Storage > Boot Volumes > Backup Policies
ServiceError status -1	—	API timeout (region unreachable)	Switch to DR region immediately: `--region us-phoenix-1`

Examples

Quick health check (CLI one-liner):

# Test if OCI API is responsive
oci iam region list --output table && echo "OCI API: OK" || echo "OCI API: UNREACHABLE"

Automated recovery wrapper:

#!/bin/bash
set -euo pipefail
INSTANCE_OCID="${1:?Usage: recover.sh <instance-ocid>}"
STATE=$(oci compute instance get --instance-id "$INSTANCE_OCID" \
  --query 'data."lifecycle-state"' --raw-output)
case "$STATE" in
  RUNNING) echo "Instance is healthy" ;;
  STOPPED) oci compute instance action --instance-id "$INSTANCE_OCID" --action START ;;
  *)       oci compute instance action --instance-id "$INSTANCE_OCID" --action RESET ;;
esac

Resources

OCI Instance Recovery — stop, start, reboot, and reset actions
OCI Status Page — official service health (may lag actual incidents)
OCI CLI Reference — command-line interface documentation
OCI Python SDK Reference — ComputeClient, BlockstorageClient APIs
OCI Known Issues — platform-known issues and workarounds
SDK Troubleshooting — connectivity and timeout debugging

Next Steps

After stabilizing the incident, run oraclecloud-debug-bundle to collect forensic data for the postmortem, or review oraclecloud-prod-checklist to harden your environment against future outages.