Cautilus
Cautilus is a repo-agnostic intentful behavior evaluation product.
It is meant to help a host repo evaluate
agent runtimes, skills, and operator-facing command surfaces with bounded
loops, explicit baselines, held-out validation, comparison summaries, and
independent review variants.
It now also includes a product-owned helper that prepares clean baseline and
candidate git worktrees for explicit A/B evaluation runs.
The product target is a standalone binary plus a bundled skill that a host repo
can adopt without inheriting another repo's private runtime surfaces.
This repo now also carries minimal Codex and Claude plugin manifests plus
repo-local marketplace wiring so that the same checked-in skill can be
installed without copying it into another local scaffold first.
The intended product shape is:
- a small CLI or runtime entrypoint for adapter-driven intentful behavior
evaluation
- repo-local adapters that define how a host repo runs iterate, held-out,
comparison, and full-gate checks
- optional executor-variant runners for structured
codex exec,
claude -p, or other bounded review passes
- a contract that separates training surfaces from held-out surfaces
- first-class use-case helpers for
chatbot, cli, and skill
evaluation packets
- a path to propose new scenarios from runtime logs instead of hand-authoring
every benchmark case forever
The longer-term direction is closer to the workflow philosophy behind DSPy:
intent and evaluation contracts matter more than preserving one prompt
verbatim, and prompts should be allowed to improve as long as the behavior
survives evaluation.
Current Status
This repo is still early.
It already owns the generic workflow and adapter contracts plus bootstrap
scripts.
It now also includes a minimal CLI, a bundled cautilus skill surface,
executor-variant runners, local Codex and Claude plugin packaging metadata,
and local tests for the bounded evaluation surface.
It also now includes checked-in GitHub workflows for verify and tagged
release artifacts, so the release/install surface is not only prose.
Current Product Surface
Current core validated surface:
- adapter-driven CLI entrypoints for
resolve, init, and doctor
- explicit A/B workspace preparation through
workspace prepare-compare
- artifact-root retention through
workspace prune-artifacts
- bounded runtime execution through
mode evaluate
- scenario-history-aware profile selection and history updates for
profile-backed mode runs
- comparison-mode baseline-cache seed materialization for profile-backed runs
- report assembly, review packet assembly, and review-variant fanout
- bounded CLI behavior evaluation through
cli evaluate
- standalone install surface, local gates, and checked-in release helpers
- repo-local Codex and Claude plugin package and marketplace wiring for local
skill install
Current product-owned helper surface:
chatbot, cli, and skill normalization helpers
- scenario proposal packet assembly and proposal generation
- scenario telemetry summaries
- normalized evidence bundling
- bounded optimization input and proposal helpers
Dogfood and migration evidence is tracked separately in
consumer-readiness.md and
consumer-migration.md.
Repo Layout