Launch GPU/TPU clusters, training jobs, and inference servers across 25+ clouds using SkyPilot. Deploy to Kubernetes pods and Slurm jobs; debug YAML configs and optimize costs in your AI workflow.
SkyPilot is a system to run, manage, and scale AI workloads on any AI infrastructure.
SkyPilot gives AI teams a simple interface to run jobs on any infra.
Infra teams get a unified control plane to manage any AI compute — with advanced scheduling, scaling, and orchestration.
:fire: News :fire:
[Mar 2026] Scaling Karpathy's Autoresearch: Autoresearch runs 1 experiment at a time. We gave it 16 GPUs and let it run in parallel: blog, HackerNews
[Mar 2026] SkyPilot Agent Skills: GPU access and job management for AI agents: docs
[Jan 2026] Shopify case study: Shopify runs all AI training workloads on SkyPilot: case study
[Dec 2025] SkyPilot v0.11 released: Multi-Cloud Pools, Fast Managed Jobs, Enterprise-Readiness at Large Scale, Programmability. Release notes
[Dec 2025] Train an agent to use Google Search as a tool with RL on your Kubernetes or clouds: blog, example
[Oct 2025] Run RL training for LLMs with SkyRL on your Kubernetes or clouds: example
Overview
SkyPilot is easy to use for AI teams:
Quickly spin up compute on your own infra
Environment and job as code — simple and portable
Easy job management: queue, run, and auto-recover many jobs
SkyPilot makes Kubernetes easy for AI & Infra teams:
Slurm-like ease of use, cloud-native robustness
Local dev experience on K8s: SSH into pods, sync code, or connect IDE
Turbocharge your clusters: gang scheduling, multi-cluster, and scaling
SkyPilot unifies multiple clusters, clouds, and hardware:
One interface to use reserved GPUs, Kubernetes clusters, Slurm clusters, or 20+ clouds