Modal Serverless Cloud Platform

Serverless Python execution with GPUs, autoscaling, and pay-per-use compute.

When to Use

Deploy and serve ML models (LLMs, image generation)
Run GPU-accelerated computation
Batch process large datasets in parallel
Schedule compute-intensive jobs
Build serverless APIs with autoscaling

Quick Start

# Install
pip install modal

# Authenticate
modal token new

import modal

app = modal.App("my-app")

@app.function()
def hello():
    return "Hello from Modal!"

# Run with: modal run script.py

Container Images

# Build image with dependencies
image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("torch", "transformers", "numpy")
)

app = modal.App("ml-app", image=image)

GPU Functions

@app.function(gpu="H100")
def train_model():
    import torch
    assert torch.cuda.is_available()
    # GPU code here

# Available GPUs: T4, L4, A10, A100, L40S, H100, H200, B200
# Multi-GPU: gpu="H100:8"

Web Endpoints

@app.function()
@modal.web_endpoint(method="POST")
def predict(data: dict):
    result = model.predict(data["input"])
    return {"prediction": result}

# Deploy: modal deploy script.py

Scheduled Jobs

@app.function(schedule=modal.Cron("0 2 * * *"))  # Daily at 2 AM
def daily_backup():
    pass

@app.function(schedule=modal.Period(hours=4))  # Every 4 hours
def refresh_cache():
    pass

Autoscaling

@app.function()
def process_item(item_id: int):
    return analyze(item_id)

@app.local_entrypoint()
def main():
    items = range(1000)
    # Automatically parallelized across containers
    results = list(process_item.map(items))

Persistent Storage

volume = modal.Volume.from_name("my-data", create_if_missing=True)

@app.function(volumes={"/data": volume})
def save_results(data):
    with open("/data/results.txt", "w") as f:
        f.write(data)
    volume.commit()  # Persist changes

Secrets Management

@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
    import os
    token = os.environ["HF_TOKEN"]

ML Model Serving

@app.cls(gpu="L40S")
class Model:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("text-classification", device="cuda")

    @modal.method()
    def predict(self, text: str):
        return self.pipe(text)

@app.local_entrypoint()
def main():
    model = Model()
    result = model.predict.remote("Modal is great!")

Resource Configuration

@app.function(
    cpu=8.0,              # 8 CPU cores
    memory=32768,         # 32 GiB RAM
    ephemeral_disk=10240, # 10 GiB disk
    timeout=3600          # 1 hour timeout
)
def memory_intensive_task():
    pass

Best Practices

Pin dependencies for reproducible builds
Use appropriate GPU types - L40S for inference, H100 for training
Leverage caching via Volumes for model weights
Use .map() for parallel processing
Import packages inside functions if not available locally
Store secrets securely - never hardcode API keys

vs Alternatives

Platform	Best For
Modal	Serverless GPUs, autoscaling, Python-native
RunPod	GPU rental, long-running jobs
AWS Lambda	CPU workloads, AWS ecosystem
Replicate	Model hosting, simple deployments

Resources

Docs: https://modal.com/docs
Examples: https://github.com/modal-labs/modal-examples

modal

Modal Serverless Cloud Platform

When to Use

Quick Start

Container Images

GPU Functions

Web Endpoints

Scheduled Jobs

Autoscaling

Persistent Storage

Secrets Management

ML Model Serving

Resource Configuration

Best Practices

vs Alternatives

Resources

Similar Skills