cautilus

Cautilus is the framework for discovering, evaluating, and improving agent behavior. It lets you pin down the behavior that matters, prove it survives every change to your prompts, skills, and models, and improve it within explicit budgets—whether you're protecting an AGENTS.md, a single skill, a prompt, or a full agent loop. The three jobs connect: discover the declared behavior claims worth proving from selected source docs, verify the curated claims through bounded evaluation packets, and improve behavior once the proof surface is honest. Cautilus ships as a standalone binary plus Cautilus Agent, which a host repo can install without copying another scaffold first. Agents are first-class users of the product surface. Commands should emit durable packets with enough state for the next agent to resume, not only terminal prose for a human operator. Cautilus installs as a machine-level binary, but its agent-facing surface is intentionally repo-local. The binary is shared across repos. The Cautilus Agent surface, adapter wiring, prompts, and instruction-routing surface are not. They stay checked into each host repo so evaluation behavior remains reproducible, reviewable, and owned by the repo that declares it.

What's Ready Today

Cautilus proves its own promises with honest badges (the apex spec): all seven apex promises currently carry proven badges in the surface audit. That does not erase narrower proof debt. Behavior evaluation is proven on the dev coding-agent surfaces while the app-ship surfaces still name live/product-runner proof debt; bounded improvement is proven on the dev/skill surface; reviewable artifacts and a testable-agent readiness surface are proven deterministically; host ownership is proven through a human-auditable fresh-consumer onboarding capture. For cross-repo adoption, the bounded evaluation loop is the most ready slice: host repos can use cautilus evaluate fixture, cautilus evaluate observation, and post-run cautilus evaluate skill-experiment with checked-in fixtures, host-owned adapters, preserved task packets, and the current evaluation and skill-experiment report packets. cautilus evaluate skill-experiment compares host-preserved baseline and variant outputs; it does not clone, install, or execute skills. Claim discovery and bounded improvement ship today and are opt-in for host repos that adopt them.

Who It Is For

teams maintaining agent runtimes or chatbot loops whose prompts, skills, and models change frequently
maintainers shipping repo-owned skills who want protected validation, not trigger-only smoke checks
operators who want review-ready outputs and explicit comparison evidence before accepting workflow changes

Day-1 trigger: your repo already has behavior that matters, but prompt tweaks and ad hoc evals no longer explain whether a candidate actually got better.

Not for: repos that only need deterministic lint, unit, or type checks and do not have an evaluator-dependent behavior surface.

Quick Start

Prerequisites:

native macOS or native Linux
a target host repo you can edit locally
git available on PATH

curl -fsSL \
  https://raw.githubusercontent.com/corca-ai/cautilus/main/install.sh \
  | sh
cd /path/to/host-repo
cautilus init

You can also hand setup to an agent instead of running these steps yourself.

Quick links:

What Cautilus promises: docs/specs/user/index.spec.md
Maintainer claim map: docs/specs/contracts/index.spec.md
Start here — Cautilus, proven on itself: docs/specs/index.spec.md
Full command catalog: docs/guides/cli.md
Fresh consumer bootstrap after the binary is on PATH: docs/guides/consumer-adoption.md
Public executable spec report: https://corca-ai.github.io/cautilus/

docs/specs/index.spec.md is the top-level "proven on itself" apex and the specdown entry; the user and maintainer spec indexes it links to remain the curated claim source of truth. Raw discover claims packets remain the high-recall, source-ref-backed proof-planning input, not the primary document a user should review. The Cautilus Agent curates that packet against the repo: reduce false positives, raise likely missing public promises, and separate in-scope discovery bugs from out-of-scope narrative gaps. The public website report is generated from the claim spec tree, but host repos do not need that renderer before Cautilus can inspect readiness, claims, evals, or improvement work. Each claim page pairs a bounded product promise with executable evidence or an explicit evidence gap. Read the user spec index to understand what Cautilus promises, then use the maintainer index to inspect proof routes, adapters, fixtures, and known gaps.

cautilus

Popularity

What's Inside

README

Cautilus

What's Ready Today

Who It Is For

Quick Start

One Bounded Eval Loop

Confidence

Similar Plugins

rashomon

evaluate-plugin

agent-eval-harness

skill-eval

skill-compass

foundry

More by corca-ai

cwf

Popularity

Health & Quality

More by corca-ai

cwf

Similar Plugins

rashomon

evaluate-plugin

agent-eval-harness

skill-eval

skill-compass

foundry