tex
The eval engine behind tex-mex and (eventually)
other audience-specific eval plugins. This repo is both:
- The
tex-eval npm package — a Node CLI that spawns Claude Code
against a corpus, measures four behavioral metrics per task, runs an
LLM-as-judge for completion, and writes comparable reports.
- A Claude Code plugin marketplace at
theDakshJaitly/tex. Today
it ships one plugin (tex-mex); more land here as audiences are
wedged.
If you're trying to use tex to evaluate something, you almost
certainly want a plugin, not the engine directly. Skip to
For users.
If you're trying to build a new audience-specific plugin or wire
tex into CI, read on after that.
What tex caught in the wild
A scaffold whose pitch was less context via curated routing shipped 5
design iterations over 5 days. All held 10/10 completion. Most improved
navigation precision 2–2.5×.
But every variant increased tokens_loaded. The best one was
+177% above baseline — 5,622 tokens vs 2,030. The scaffold's whole
pitch was less context. The candidate would have shipped silently
without the harness.
Decision: shelved, not shipped. Full report:
eval-reports/0.3.0-alpha.md.
This is what tex is for. It doesn't decide for you. It surfaces the
contradiction loudly enough that "ship it" stops being the default.
For users
If you're iterating on a mex-shaped scaffold
You want the tex-mex plugin. Inside Claude Code:
/plugin marketplace add theDakshJaitly/tex
/plugin install tex-mex@tex
Then /tex-mex to run an eval. Two questions, then it does the work.
Full pitch and details: plugins/tex-mex/README.md.
If you're iterating on an agent-facing CLI (e.g., a cli-printing-press output)
A tex-cli plugin is on the roadmap. Until it lands, the engine
supports the CLI workflow directly — see For plugin authors / CI
below and use tex run --subject cli:<path>.
If you're iterating on an MCP server or prompt layer
The engine's subject loaders for mcp and prompt are stubbed for
v1.1. Use --subject none as a control today; the diff against a
fixture that hand-wires your MCP into .claude/settings.json is the
manual workaround.
For plugin authors / CI
The engine ships as the tex-eval npm package. Install:
npm install -g tex-eval
tex --version
tex --help
You also need to be logged into Claude Code (claude /login once,
cached in your keychain). No API key required by default.
Quickstart
# Verify the pipeline
tex smoke
# Scaffold a starter corpus (cli or scaffold templates)
tex init --kind scaffold --var scaffold_name=foo --var scaffold_purpose="..."
# Validate
tex validate corpus
# Run
tex run --label baseline --subject scaffold
# Compare
tex diff results/baseline/report.json results/candidate/report.json
CLI reference
| Command | What it does |
|---|
tex init --kind <cli|scaffold> --var ... | Scaffold a starter corpus + fixture. mcp and prompt are stubbed for v1.1. |
tex validate [<dir>] | Load corpus, report errors. Exit non-zero on failure. |
tex run --label <name> --subject <arg> [--auth oauth|key] [--task <id>] | Run the corpus; write results/<label>/report.{json,md} |
tex diff <baseline.json> <candidate.json> | Markdown diff to stdout + eval-reports/ |
tex smoke [--auth oauth|key] | One no-op task against the bundled fixture |
tex detect [<path>] | Classify a directory as mcp / cli / scaffold / etc. |
--subject accepts a shorthand (none, scaffold, cli:<path>,
mcp:<config>, prompt:<path>) or a path to a JSON SubjectConfig.
The four metrics
| Metric | What it measures |
|---|
tokens_loaded | Approximate token count of files the agent actually read (chars/4). |
navigation.precision | Read-files ∩ expected-files / read-files |
navigation.recall | Read-files ∩ expected-files / expected-files |
time_to_first_output_ms | Wall-clock from spawn to first agent text |
completion.overall_score | LLM-judged pass rate across binary rubric criteria, scaled 0–10 |
Deliberately decoupled. A change can improve one and regress another —
the per-task table tells you which, the aggregate table tells you
how much, the rubric breakdown tells you why.
Auth modes
--auth oauth (default) uses your Claude Code subscription. Spawned
sessions inherit your CLAUDE.md / hooks / plugin context — that's
~18.8k tokens of "pollution" per session, constant across compared
runs, so deltas remain meaningful but absolute scores aren't portable.
--auth key opt-in requires ANTHROPIC_API_KEY and prepends --bare
to every spawn, stripping all of that. Real money. Portable scores.
Recommended for CI and published benchmarks.
tex diff warns when comparing reports with mismatched auth modes.