Hawk
Run Inspect AI evaluations at scale on Kubernetes.
Define your tasks, agents, and models in a YAML config. Hawk runs every combination on isolated Kubernetes pods, streams logs to your terminal, imports results into a PostgreSQL warehouse, and gives you a web UI to explore everything.
Why Hawk
- 📋 One YAML, full grid. Define tasks, agents, and models. Hawk runs the Cartesian product.
- ☸️ Kubernetes-native. Each eval gets its own pod and fresh virtualenv. Sandboxes run in separate pods with Cilium network policies for multi-tenant isolation.
- 🔑 Built-in LLM proxy. Managed proxy for OpenAI, Anthropic, and Google Vertex with automatic token refresh. No API keys to juggle (or bring your own).
- 📡 Live monitoring.
hawk logs -f streams logs in real-time. hawk status gives you a structured JSON report. Every job gets a Datadog dashboard URL on submission.
- 🖥️ Web UI. Browse eval sets, filter samples by score range and full-text search, compare across eval sets, export to CSV. Filter state lives in the URL for sharing.
- 🔍 Scout scanning. Run scanners over transcripts from previous evals. Filter transcripts by status, score, model, metadata with a rich query DSL.
- 🗄️ Data warehouse. Results land in PostgreSQL with trigram search, covering indexes, and computed status columns.
- 🔒 Access control. Model group permissions gate who can run models, view logs, and scan eval sets. S3 Object Lambda enforces permissions per-object.
- ✏️ Sample editing. Batch edit scores, invalidate or un-invalidate samples. Full audit trail.
- 💻 Local mode.
hawk local eval-set runs the same config on your machine. --direct skips the venv so you can attach a debugger.
- 🔄 Resumable scans. Configs save to S3.
hawk scan resume picks up where you left off.
Get Started
uv pip install "hawk[cli] @ git+https://github.com/METR/inspect-action"
hawk login
hawk eval-set examples/simple.eval-set.yaml
hawk logs -f # watch it run
hawk web # open results in browser
Prerequisites
Before using Hawk, ensure you have:
- Python 3.11 or later
- uv for dependency management
- Access to a Hawk deployment - You'll need:
- Hawk API server URL
- Authentication credentials (OAuth2)
- For deployment: Kubernetes cluster, AWS account, Terraform 1.10+
Installation
Install the Hawk CLI:
uv pip install "hawk[cli] @ git+https://github.com/METR/inspect-action"
Or install from source:
git clone https://github.com/METR/inspect-action.git
cd inspect-action
uv pip install -e .[cli]
Quick Start
1. Authenticate
First, log in to your Hawk server:
hawk login
This will open a browser for OAuth2 authentication.
2. Run Your First Evaluation
Create a simple eval config file or use an example:
hawk eval-set examples/simple.eval-set.yaml
3. View Results
Open the evaluation in your browser:
hawk web
Or view logs and results in the configured log viewer.
Configuration
Required Environment Variables
Set these before using the Hawk CLI:
| Variable | Required | Description | Example |
|---|
HAWK_API_URL | Yes | URL of your Hawk API server | https://hawk.example.com |
INSPECT_LOG_ROOT_DIR | Yes | S3 bucket for eval logs | s3://my-bucket/evals |
LOG_VIEWER_BASE_URL | No | URL for web log viewer | https://viewer.example.com |
You can set these in a .env file in your project directory or export them in your shell:
export HAWK_API_URL=https://hawk.example.com
export INSPECT_LOG_ROOT_DIR=s3://my-bucket/evals
Authentication Variables
For API server and CLI authentication:
INSPECT_ACTION_API_MODEL_ACCESS_TOKEN_AUDIENCE
INSPECT_ACTION_API_MODEL_ACCESS_TOKEN_ISSUER
INSPECT_ACTION_API_MODEL_ACCESS_TOKEN_JWKS_PATH
For log viewer authentication (can be different):
VITE_API_BASE_URL - Should match HAWK_API_URL
VITE_OIDC_ISSUER
VITE_OIDC_CLIENT_ID
VITE_OIDC_TOKEN_PATH
Running Eval Sets
hawk eval-set examples/simple.eval-set.yaml
The Eval Set Config File
The eval set config file is a YAML file that defines a grid of tasks, solvers/agents, and models to evaluate.
See examples/simple.eval-set.yaml for a minimal working example.
Required Fields
tasks:
- package: git+https://github.com/UKGovernmentBEIS/inspect_evals
name: inspect_evals
items:
- name: mbpp
sample_ids: [1, 2, 3] # Optional: test specific samples
models:
- package: openai
name: openai
items:
- name: gpt-4o-mini
Optional Fields