tex

The eval engine behind tex-mex and (eventually) other audience-specific eval plugins. This repo is both:

The tex-eval npm package — a Node CLI that spawns Claude Code against a corpus, measures four behavioral metrics per task, runs an LLM-as-judge for completion, and writes comparable reports.
A Claude Code plugin marketplace at theDakshJaitly/tex. Today it ships one plugin (tex-mex); more land here as audiences are wedged.

If you're trying to use tex to evaluate something, you almost certainly want a plugin, not the engine directly. Skip to For users.

If you're trying to build a new audience-specific plugin or wire tex into CI, read on after that.

What tex caught in the wild

A scaffold whose pitch was less context via curated routing shipped 5 design iterations over 5 days. All held 10/10 completion. Most improved navigation precision 2–2.5×.

But every variant increased tokens_loaded. The best one was +177% above baseline — 5,622 tokens vs 2,030. The scaffold's whole pitch was less context. The candidate would have shipped silently without the harness.

Decision: shelved, not shipped. Full report: eval-reports/0.3.0-alpha.md.

This is what tex is for. It doesn't decide for you. It surfaces the contradiction loudly enough that "ship it" stops being the default.

For users

If you're iterating on a mex-shaped scaffold

You want the tex-mex plugin. Inside Claude Code:

/plugin marketplace add theDakshJaitly/tex
/plugin install tex-mex@tex

Then /tex-mex to run an eval. Two questions, then it does the work.

Full pitch and details: plugins/tex-mex/README.md.

If you're iterating on an agent-facing CLI (e.g., a cli-printing-press output)

A tex-cli plugin is on the roadmap. Until it lands, the engine supports the CLI workflow directly — see For plugin authors / CI below and use tex run --subject cli:<path>.

If you're iterating on an MCP server or prompt layer

The engine's subject loaders for mcp and prompt are stubbed for v1.1. Use --subject none as a control today; the diff against a fixture that hand-wires your MCP into .claude/settings.json is the manual workaround.

For plugin authors / CI

The engine ships as the tex-eval npm package. Install:

npm install -g tex-eval
tex --version
tex --help

You also need to be logged into Claude Code (claude /login once, cached in your keychain). No API key required by default.

Quickstart

# Verify the pipeline
tex smoke

# Scaffold a starter corpus (cli or scaffold templates)
tex init --kind scaffold --var scaffold_name=foo --var scaffold_purpose="..."

# Validate
tex validate corpus

# Run
tex run --label baseline --subject scaffold

# Compare
tex diff results/baseline/report.json results/candidate/report.json

CLI reference

Command	What it does
`tex init --kind <cli\|scaffold> --var ...`	Scaffold a starter corpus + fixture. `mcp` and `prompt` are stubbed for v1.1.
`tex validate [<dir>]`	Load corpus, report errors. Exit non-zero on failure.
`tex run --label <name> --subject <arg> [--auth oauth\|key] [--task <id>]`	Run the corpus; write `results/<label>/report.{json,md}`
`tex diff <baseline.json> <candidate.json>`	Markdown diff to stdout + `eval-reports/`
`tex smoke [--auth oauth\|key]`	One no-op task against the bundled fixture
`tex detect [<path>]`	Classify a directory as mcp / cli / scaffold / etc.

--subject accepts a shorthand (none, scaffold, cli:<path>, mcp:<config>, prompt:<path>) or a path to a JSON SubjectConfig.

The four metrics

Metric	What it measures
`tokens_loaded`	Approximate token count of files the agent actually read (chars/4).
`navigation.precision`	Read-files ∩ expected-files / read-files
`navigation.recall`	Read-files ∩ expected-files / expected-files
`time_to_first_output_ms`	Wall-clock from spawn to first agent text
`completion.overall_score`	LLM-judged pass rate across binary rubric criteria, scaled 0–10

Deliberately decoupled. A change can improve one and regress another — the per-task table tells you which, the aggregate table tells you how much, the rubric breakdown tells you why.

Auth modes

--auth oauth (default) uses your Claude Code subscription. Spawned sessions inherit your CLAUDE.md / hooks / plugin context — that's ~18.8k tokens of "pollution" per session, constant across compared runs, so deltas remain meaningful but absolute scores aren't portable.

--auth key opt-in requires ANTHROPIC_API_KEY and prepends --bare to every spawn, stripping all of that. Real money. Portable scores. Recommended for CI and published benchmarks.

tex diff warns when comparing reports with mismatched auth modes.

Help us improve

Find plugins for your project

Help us improve

tex-mex

Popularity

What's Inside

README

tex

What tex caught in the wild

For users

If you're iterating on a mex-shaped scaffold

If you're iterating on an agent-facing CLI (e.g., a cli-printing-press output)

If you're iterating on an MCP server or prompt layer

For plugin authors / CI

Quickstart

CLI reference

The four metrics

Auth modes

Help us improve

Health & Quality

Confidence

Similar Plugins

claude-mem

caveman

llm-council-plugin

self-improving-agent

ecc

claude-buddy

More by theDakshJaitly

mex-call