From replicate
Packages and builds custom AI models with Cog for Replicate deployment. Creates cog.yaml and predict.py, builds Docker images, handles GPU/CUDA setup, and ports Hugging Face models.
npx claudepluginhub replicate/skills --plugin prompt-videosThis skill uses the workspace's default tool permissions.
- Cog reference (single file): <https://cog.run/llms.txt>
Publishes custom AI models to Replicate using cog push and cog-safe-push, validates with schema checks, tests, fuzzing, and sets up GitHub Actions CI/CD for safe releases.
Manages Hugging Face Hub via CLI: download/upload models/datasets/spaces/repos, handle auth/cache/buckets/jobs/webhooks/inference endpoints. For HF ecosystem/AI/ML tasks.
Deploys trained ML models to production via REST APIs, Docker containers, Kubernetes clusters, with data validation, error handling, and performance monitoring.
Share bugs, ideas, or general feedback.
cog.yaml reference: https://cog.run/yamlcog.yaml, predict.py, or train.py.publish-models.run-models.brew install replicate/tap/cog or sh <(curl -fsSL https://cog.run/install.sh).cog init to scaffold cog.yaml and predict.py.The canonical Replicate model layout:
cog.yaml
predict.py
weights.py # optional download helpers
requirements.txt
cog-safe-push-configs/
default.yaml # see publish-models skill
.github/workflows/
ci.yaml
script/ # github.com/github/scripts-to-rule-them-all
lint
test
push
A modern config for a GPU model:
build:
gpu: true
cuda: "12.8"
python_version: "3.12"
python_requirements: requirements.txt
system_packages:
- libgl1
- libglib2.0-0
predict: predict.py:Predictor
Notes:
requirements.txt. Floating versions break cold boots.python_requirements over inline python_packages once the list grows.cuda follows your torch wheel (e.g. 12.8 paired with torch==2.7.1+cu128).train: train.py:train if your model is fine-tunable.image: r8.im/owner/name to enable bare cog push.For async predictors with continuous batching:
concurrency:
max: 32
from cog import BasePredictor, Input, Path
class Predictor(BasePredictor):
def setup(self) -> None:
"""One-time loads. Heavy work goes here, not in predict()."""
self.model = load_model("weights/")
def predict(
self,
prompt: str = Input(description="Text prompt for generation"),
seed: int = Input(description="Random seed; leave blank for random", default=None),
num_steps: int = Input(description="Number of denoising steps", ge=1, le=50, default=20),
output_format: str = Input(description="Output image format", choices=["webp", "jpg", "png"], default="webp"),
) -> Path:
"""Run a single prediction."""
if not prompt.strip():
raise ValueError("prompt cannot be empty")
out = self.model.generate(prompt, seed=seed, steps=num_steps)
return Path(out)
Input rules:
description. The description shows up in the model schema and on Replicate's web UI.ge/le for numeric bounds, choices=[...] for enums, regex= for strings.cog.Path for file inputs and outputs, never raw bytes.cog.Secret for any token-like input (HF tokens, API keys), never plain str.choices for categorical inputs.predict() and raise ValueError.Streaming text output (for LLMs):
from cog import BasePredictor, Input, ConcatenateIterator
class Predictor(BasePredictor):
def predict(self, prompt: str = Input(description="Prompt")) -> ConcatenateIterator[str]:
for token in self.model.stream(prompt):
yield token
Async predictor with continuous batching (paired with concurrency.max in cog.yaml):
from cog import BasePredictor, Input, AsyncConcatenateIterator
class Predictor(BasePredictor):
async def setup(self) -> None:
self.engine = await load_async_engine()
async def predict(
self,
prompt: str = Input(description="Prompt"),
) -> AsyncConcatenateIterator[str]:
async for token in self.engine.generate(prompt):
yield token
Dynamic choices from on-disk assets (e.g. a voices/ directory of audio samples):
from pathlib import Path as _P
AVAILABLE_VOICES = sorted(p.stem for p in _P("voices").glob("*.wav"))
class Predictor(BasePredictor):
def predict(
self,
speaker: str = Input(description="Voice", choices=AVAILABLE_VOICES, default=AVAILABLE_VOICES[0]),
) -> Path: ...
Cold boot dominates user-perceived latency. Three patterns, ranked by simplicity:
Best for small or medium weights (< 5GB) that you want zero-cold-boot for.
For torchvision:
import os
os.environ["TORCH_HOME"] = "." # set before importing torch
import torch
from torchvision import models
For HuggingFace:
import os
os.environ["HF_HUB_CACHE"] = "./.cache"
os.environ["HF_XET_HIGH_PERFORMANCE"] = "1"
Then download once during cog build (e.g. in a run: step or by running a small fetcher script as part of the build). The weights become part of the image layer.
weights.replicate.delivery with pgetBest for large weights, or when you want to share weights across multiple models. pget is Replicate's parallel HTTP fetcher.
In cog.yaml:
build:
run:
- curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.8.2/pget_linux_x86_64"
- chmod +x /usr/local/bin/pget
In setup():
import subprocess
from pathlib import Path
WEIGHTS_URL = "https://weights.replicate.delivery/default/my-model/weights.tar"
WEIGHTS_DIR = Path("weights")
class Predictor(BasePredictor):
def setup(self) -> None:
if not WEIGHTS_DIR.exists():
# -x extracts tar in-memory; default concurrency is 4 * NumCPU
subprocess.check_call(["pget", "-x", WEIGHTS_URL, str(WEIGHTS_DIR)])
self.model = load_from(WEIGHTS_DIR)
For multiple files in one shot:
manifest = "\n".join([
f"{base}/unet.safetensors weights/unet.safetensors",
f"{base}/vae.safetensors weights/vae.safetensors",
f"{base}/text_encoder.safetensors weights/text_encoder.safetensors",
])
subprocess.run(["pget", "multifile", "-"], input=manifest, text=True, check=True)
Set HF_HUB_ENABLE_HF_TRANSFER=1 and use huggingface_hub.snapshot_download or from_pretrained. Faster than vanilla HF downloads. Use a cog.Secret input for gated models.
For LoRAs or any weights URL the user passes at predict time, use a sha256-keyed disk cache with LRU eviction:
import hashlib, shutil, subprocess
from pathlib import Path
class WeightsDownloadCache:
def __init__(self, cache_dir: str = "/tmp/weights-cache", min_disk_free_gb: int = 10):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(parents=True, exist_ok=True)
self.min_disk_free = min_disk_free_gb * 1024**3
def ensure(self, url: str) -> Path:
key = hashlib.sha256(url.encode()).hexdigest()
target = self.cache_dir / key
if target.exists():
target.touch() # bump LRU mtime
return target
self._evict_until_room()
subprocess.check_call(["pget", url, str(target)])
return target
def _evict_until_room(self) -> None:
while shutil.disk_usage(self.cache_dir).free < self.min_disk_free:
entries = sorted(self.cache_dir.iterdir(), key=lambda p: p.stat().st_mtime)
if not entries:
return
entries[0].unlink()
See replicate/cog-flux/weights.py for a production version that handles HF, CivitAI, Replicate, and arbitrary .safetensors URLs.
Reload only when the URL changes; compose two LoRAs with separate scales:
class Predictor(BasePredictor):
def setup(self) -> None:
self.pipe = load_base_pipeline()
self.loaded = {"main": None, "extra": None}
def _ensure_lora(self, slot: str, url: str | None) -> None:
if url == self.loaded[slot]:
return
if self.loaded[slot] is not None:
self.pipe.unload_lora_weights(adapter_name=slot)
if url:
path = self.cache.ensure(url)
self.pipe.load_lora_weights(str(path), adapter_name=slot)
self.loaded[slot] = url
def predict(
self,
prompt: str = Input(description="Prompt"),
lora_url: str = Input(description="Primary LoRA URL", default=None),
lora_scale: float = Input(description="Primary LoRA scale", ge=0.0, le=2.0, default=1.0),
extra_lora_url: str = Input(description="Optional second LoRA URL", default=None),
extra_lora_scale: float = Input(description="Second LoRA scale", ge=0.0, le=2.0, default=1.0),
) -> Path:
self._ensure_lora("main", lora_url)
self._ensure_lora("extra", extra_lora_url)
adapters = [s for s, u in self.loaded.items() if u]
scales = [lora_scale if s == "main" else extra_lora_scale for s in adapters]
if adapters:
self.pipe.set_adapters(adapters, adapter_weights=scales)
return Path(self.pipe(prompt).images[0].save("/tmp/out.png"))
From production diffusion models like replicate/cog-flux and replicate/cog-flux-kontext:
setup():
import torch
torch.set_float32_matmul_precision("high")
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
self.model = torch.compile(self.model, dynamic=True)
_ = self.predict(prompt="warmup", num_steps=1) # absorbs compile cost in setup
assign=True to avoid double-allocating:
with torch.device("meta"):
model = build_model_skeleton()
state = torch.load("weights.pt", map_location="cpu")
model.load_state_dict(state, assign=True)
cog init # scaffold cog.yaml + predict.py
cog predict -i prompt="hello" # build + run a single prediction
cog predict -i image=@input.jpg -o out.png # file inputs and outputs
cog serve -p 8393 # HTTP server matching production
cog exec python # interactive shell inside the build env
cog build -t my-model
cog build --separate-weights -t my-model # weights in their own image layer
cog build --secret id=hf,src=$HOME/.hf_token -t my-model
Tips:
--separate-weights for any model with weights > ~1GB. It speeds up cold boots and registry pushes.--mount=type=cache,target=/root/.cache/pip in run: steps to cache pip across builds.--secret instead of ARG to keep tokens out of image history.--use-cog-base-image=true) is faster than rolling your own.If your model supports fine-tuning, add train: train.py:train to cog.yaml and write a train() function that returns TrainingOutput(weights=Path("model.tar")). The predictor then accepts the URL via setup(self, weights) or the COG_WEIGHTS env var. See https://cog.run/training and replicate/flux-fine-tuner for a full example.
setup() for one-time loads; keep predict() fast and deterministic in shape.numpy<2 if your torch is older.cog.Path for files and cog.Secret for tokens.pget to a specific release (v0.8.2) for reproducibility.HF_HUB_ENABLE_HF_TRANSFER=1 whenever you call HuggingFace Hub.TRANSFORMERS_OFFLINE=1 after weights are loaded to prevent runtime HF lookups.cog predict before pushing. If it doesn't work locally, it won't work in production.choices, minimal cog.yaml