Skill

airunway-aks-setup

Sets up AI Runway on AKS clusters from bare setup to running model, covering cluster verification, controller install, GPU assessment, provider setup, and first deployment.

Kubernetes

Azure

devops

ai-ml

Install

npx claudepluginhub anthropics/claude-plugins-official --plugin azure

Tool Access

This skill uses the workspace's default tool permissions.

Preview

This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides `skip-to-step N` to resume from a specific phase.

Supporting Assets

references/gpu-profiles.mdreferences/model-sizing.mdreferences/powershell-notes.mdreferences/steps/step-1-verify.mdreferences/steps/step-2-controller.mdreferences/steps/step-3-gpu.mdreferences/steps/step-4-provider.mdreferences/steps/step-5-deploy.mdreferences/steps/step-6-summary.mdreferences/troubleshooting.md

SKILL.md

Similar Skills

design-system

Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.

team-skills-platform

167.4k

ui-demo

Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.

team-skills-platform

167.4k

kotlin-patterns

Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.

team-skills-platform

167.4k

Stats

Stars722

Forks111

Last CommitApr 22, 2026

Actions

View Source View Plugin View on GitHub View README

AI Runway AKS Setup

This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides skip-to-step N to resume from a specific phase.

Cost awareness: GPU node pools incur significant compute charges (A100-80GB can cost $3–5+/hr). Confirm the user understands cost implications before provisioning GPU resources.

Prerequisites

This skill assumes an AKS cluster already exists. If the user does not have a cluster, hand off to the azure-kubernetes skill first to provision one (with a GPU node pool unless CPU-only inference is acceptable), then return here.

Quick Reference

Property	Value
Best for	End-to-end AI Runway onboarding on AKS
CLI tools	`kubectl`, `make`, `curl`
MCP tools	None
Related skills	`azure-kubernetes` (cluster setup), `azure-diagnostics` (troubleshooting)

When to Use This Skill

Use this skill when the user wants to:

Set up AI Runway on an existing AKS cluster from scratch
Install the AI Runway controller and CRDs
Assess GPU hardware compatibility for model deployment
Choose and install an inference provider (KAITO, Dynamo, KubeRay)
Deploy their first AI model to AKS via AI Runway
Resume a partially-complete AI Runway setup from a specific step

MCP Tools

This skill uses no MCP tools. All cluster operations are performed directly via kubectl and make.

Rules

Execute steps in sequence — load the reference for each step as you reach it
Report cluster state at each step: ✓ healthy, ✗ missing/failed
Ask for user confirmation before any install or deployment action
If a step is already complete, report status and skip to the next step
If the user provides skip-to-step N, start at step N; assume prior steps are complete

Steps

#	Step	Reference
1	Cluster Verification — context check, node inventory, GPU detection	step-1-verify.md
2	Controller Installation — CRD + controller deployment	step-2-controller.md
3	GPU Assessment — detect GPU models, flag dtype/attention constraints	step-3-gpu.md
4	Provider Setup — recommend and install inference provider	step-4-provider.md
5	First Deployment — pick a model, deploy, verify Ready	step-5-deploy.md
6	Summary — recap, smoke test, next steps	step-6-summary.md

Error Handling

Error / Symptom	Likely Cause	Remediation
No kubeconfig context	Not connected to a cluster	Run `az aks get-credentials` or equivalent
Controller in CrashLoopBackOff	Config or RBAC issue	`kubectl logs -n airunway-system -l control-plane=controller-manager --previous`
Provider not ready	Image pull or RBAC issue	`kubectl logs <pod-name> -n <namespace>` for the provider pod
ModelDeployment stuck in Pending	GPU scheduling failure or provider not ready	`kubectl describe modeldeployment <name> -n <namespace>` events
`bfloat16` errors at inference	T4 or V100 lacks bfloat16 support	Add `--dtype float16` to serving args

For full error handling and rollback procedures, see troubleshooting.md.