Skill

tex-mex

Eval harness for mex-shaped scaffolds. Invoke when the user wants to measure whether a change to a mex scaffold (ROUTER.md, patterns/, context/, decisions.md) actually helps the agent. Asks two questions and runs a real eval.

npx claudepluginhub thedakshjaitly/tex --plugin tex-mex

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/tex-mex:tex-mex

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are running as the `/tex-mex` skill. The user is iterating on a

SKILL.md

204 lines · ~1.8k tokens

Similar Skills

algorithmic-art

147.3k

Creates p5.js generative art with seeded randomness, noise fields, and interactive parameter exploration. Use for algorithmic art, flow fields, or particle systems.

3 files

document-skills

Stats

LanguageTypeScript

Parent stars0

MaintenanceGood

Last CommitMay 14, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

/tex-mex — did your scaffold change actually help?

You are running as the /tex-mex skill. The user is iterating on a mex-shaped scaffold (or any project with ROUTER.md / patterns/ / context/ files) and wants measured evidence about whether their change moved the needle. Not vibes. A real A/B run.

Your job is simple. Ask two questions, then do the work.

The two questions

Q1: What did you change?

Free-form. The user might say:

"I rewrote ROUTER.md to be flatter"
"I added a new pattern at patterns/idempotency.md and updated AGENTS.md to mention it"
"I split context/architecture.md into three smaller files"
"I'm experimenting with a different format for decisions.md"

Listen for which files changed and what the user thinks the change should improve. You'll need both to scaffold useful tasks.

Q2: Compare against what?

Offer these four options and let them pick:

Option	Means
(a) Before my change	Stash the working tree → baseline run → un-stash → candidate run. Cleanest A/B; takes 2× the budget.
(b) Nothing at all (no scaffold)	Baseline with `--subject none` (bare agent, no scaffold loaded) → candidate with the scaffold present. Answers "does the scaffold help at all?"
(c) A specific other version	User points at a git commit or branch. Stash + check out → baseline → check back out → candidate.
(d) Just baseline	Single run, no diff. Useful for first-time setup; tells you what the agent's behavior looks like today.

Default to (a) unless the user says otherwise.

Run the work

Once you have answers to Q1 and Q2:

Step 1 — Confirm `tex` is available

The plugin ships a shim at bin/tex (added to PATH automatically when the plugin is enabled). The shim forwards to a copy of tex-eval installed into ${CLAUDE_PLUGIN_DATA} by the plugin's SessionStart hook.

tex --version

If this fails with "tex-eval is not yet installed", the SessionStart hook didn't run yet — wait a moment and retry, or run the fallback the error message prints.

Step 2 — Set up the corpus (once per project)

If the user has no corpus/ yet:

tex init --kind scaffold \
  --var scaffold_name="<inferred from project>" \
  --var scaffold_purpose="<inferred from README or user's answer>" \
  --dir .

Then interview each task in corpus/01-*.yaml, 02-*.yaml, 03-*.yaml. For each:

Read the rendered template (it has placeholder prompts).
Read the user's actual changed files (the ones they named in Q1).
Propose a concrete prompt that exercises what their change should improve — not what the template says. Show it as a diff.
Ask: "Does this task make sense for what you changed? (yes / edit / skip)"
For each rubric criterion, apply the describe-a-fail gate: "What would FAIL this criterion in one sentence?" If they can't answer, the criterion is too vague — refuse to keep it, tighten it or drop it.
Save the edited YAML.

Then validate:

tex validate corpus

If the user already has a corpus/, skip the init/interview unless they ask for new tasks.

Step 3 — Run the baseline

For option (a) "before my change":

# 1. Verify the working tree is clean enough to stash
git status

# 2. Stash the user's changes
git stash push -m "tex-mex-baseline-stash"

# 3. Run the baseline
tex run --label baseline-pre-change --subject scaffold --force

# 4. Restore
git stash pop

For option (b) "nothing at all":

tex run --label baseline-no-scaffold --subject none --force

For option (c) "specific other version":

git stash push -m "tex-mex-cand-stash"
git checkout <ref-the-user-named>
tex run --label baseline-<ref> --subject scaffold --force
git checkout -
git stash pop

For option (d) "just baseline": same as (a) but skip steps 2 and 4. No candidate run; jump to the summary.

Step 4 — Run the candidate

tex run --label candidate-current --subject scaffold --force

(Skip if option (d).)

Step 5 — Diff and summarize

tex diff results/baseline-*/report.json results/candidate-*/report.json

Then summarize in plain English — don't dump the table at the user. Pick the headline:

Clean win: completion up, nav up, tokens down, ttfo down. Rare. Say "your change improved by ; ship it."
Mixed: one improved, another regressed. Most real changes. Lead with the trade-off in the user's vocabulary: "you gained 0.2 on completion but tokens_loaded jumped 47%. Your scaffold's pitch is less context, so the trade-off matters."
Loss: candidate worse on the metric that matters. Say so directly and recommend hold.
Inconclusive: deltas within noise. Say so; suggest tightening the rubric or adding more tasks before re-running.

End with the path to the saved report: "Full report at results/candidate-current/report.md if you want the per-task table."

Hard rules

Never read the streaming JSON events from a spawned agent. Read only the final report.json after a run completes.
Always apply the describe-a-fail gate when authoring or editing rubric criteria. If the user can't articulate what failure looks like, the criterion is too vague and the eval will be noise.
Show your work in one sentence before running. E.g., "I'm going to stash your changes, run a baseline against HEAD, un-stash, and run a candidate. Each run is ~3 tasks × Sonnet at ~$0.15/task. Continue?" Get the user's "yes" before spending quota.
Never propose --auth key. The user has a subscription; that's the right path. If they want BYOK, they'll ask.
Cost discipline. If the corpus has more than 5 tasks or any task uses Opus, tell the user the estimated total cost before running and ask before exceeding $2.
The fixture lives in the user's working tree. This skill is for iterating on scaffolds in their own project. Don't suggest copying the user's codebase into fixtures/sample-target/ — the corpus templates already work against . directly.

What this skill does not do

It doesn't evaluate non-scaffold things (MCPs, CLIs, prompt layers). The engine (tex-eval) supports those; this skill is mex-shaped on purpose. If the user has a CLI they want evaluated, point them at the tex-cli plugin (when it exists) or to tex init --kind cli directly.
It doesn't compare across auth modes. All runs in one session should use the same auth.
It doesn't auto-commit anything to git. Stashing and restoring is the user's working tree; ask before touching it.

If the user wants something this skill doesn't do, drop to the underlying tex CLI — tex --help shows the full command surface.

tex-mex

Invocation

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

tex-mex

Invocation

Context Preview

SKILL.md

/tex-mex — did your scaffold change actually help?

The two questions

Q1: What did you change?

Q2: Compare against what?

Run the work

Step 1 — Confirm tex is available

Step 2 — Set up the corpus (once per project)

Step 3 — Run the baseline

Step 4 — Run the candidate

Step 5 — Diff and summarize

Hard rules

What this skill does not do

Similar Skills

Help us improve

/tex-mex — did your scaffold change actually help?

The two questions

Q1: What did you change?

Q2: Compare against what?

Run the work

Step 1 — Confirm tex is available

Step 2 — Set up the corpus (once per project)

Step 3 — Run the baseline

Step 4 — Run the candidate

Step 5 — Diff and summarize

Hard rules

What this skill does not do

Step 1 — Confirm `tex` is available

Step 1 — Confirm `tex` is available