From plaited-plaited
Trial runner and strategy comparison CLI for the Plaited framework. Use when running evals against agent adapters, computing pass@k metrics, or comparing two eval runs.
npx claudepluginhub joshuarweaver/cascade-code-general-misc-3 --plugin plaited-plaitedThis skill uses the workspace's default tool permissions.
CLI-first evaluation tooling for running prompt trials against agent adapters and comparing strategy performance.
Applies Acme Corporation brand guidelines including colors, fonts, layouts, and messaging to generated PowerPoint, Excel, and PDF documents.
Builds DCF models with sensitivity analysis, Monte Carlo simulations, and scenario planning for investment valuation and risk assessment.
Calculates profitability (ROE, margins), liquidity (current ratio), leverage, efficiency, and valuation (P/E, EV/EBITDA) ratios from financial statements in CSV, JSON, text, or Excel for investment analysis.
CLI-first evaluation tooling for running prompt trials against agent adapters and comparing strategy performance.
List all available commands:
bunx plaited --schema
Discover input/output schemas for a specific command:
bunx plaited eval --schema input
bunx plaited eval --schema output
bunx plaited compare-trials --schema input
bunx plaited compare-trials --schema output
Runs prompts against an adapter k times, optionally grades results, and computes pass@k/pass^k metrics.
# Run eval with prompts from stdin
echo '{"id":"test-1","input":"Hello"}' | bunx plaited eval '{
"adapterPath": "./my-adapter.ts",
"k": 1
}'
# Run eval with prompts from file
bunx plaited eval '{
"adapterPath": "./my-adapter.ts",
"promptsPath": "./prompts.jsonl",
"outputPath": "./results.jsonl",
"k": 3,
"concurrency": 2
}'
bunx plaited eval '{
"adapterPath": "./my-adapter.ts",
"promptsPath": "./prompts.jsonl",
"graderPath": "./my-grader.ts",
"k": 5,
"outputPath": "./results.jsonl"
}'
| Field | Required | Description |
|---|---|---|
adapterPath | Yes | Path to adapter script (.ts/.js module or executable) |
promptsPath | No | Path to prompts.jsonl (default: read from stdin) |
outputPath | No | Output file path (default: stdout) |
k | No | Trials per prompt (default: 1) |
graderPath | No | Path to grader script |
cwd | No | Working directory for adapter |
timeout | No | Timeout per prompt in ms (default: 60000) |
concurrency | No | Concurrent workers (default: 1) |
workspaceDir | No | Per-prompt workspace isolation base dir |
progress | No | Show progress to stderr (default: false) |
append | No | Append to output file (default: false) |
debug | No | Enable debug mode (default: false) |
Compares two TrialResult JSONL runs and computes aggregate metrics plus per-prompt deltas with bootstrap confidence intervals.
bunx plaited compare-trials '{
"baselinePath": "./baseline-results.jsonl",
"challengerPath": "./challenger-results.jsonl"
}'
bunx plaited compare-trials '{
"baselinePath": "./baseline-results.jsonl",
"challengerPath": "./challenger-results.jsonl",
"baselineLabel": "gpt-4",
"challengerLabel": "gpt-4o",
"confidence": 0.95,
"resamples": 1000
}'
| Field | Required | Default | Description |
|---|---|---|---|
baselinePath | Yes | - | Path to baseline TrialResult JSONL |
challengerPath | Yes | - | Path to challenger TrialResult JSONL |
baselineLabel | No | "baseline" | Label for baseline run |
challengerLabel | No | "challenger" | Label for challenger run |
confidence | No | 0.95 | Confidence level for bootstrap CI |
resamples | No | 1000 | Bootstrap resamples |
The comparison returns:
Eval runs produce TrialResult objects as JSONL:
{
"id": "prompt-case-id",
"input": "The prompt text",
"k": 3,
"passRate": 0.67,
"passAtK": 0.96,
"passExpK": 0.30,
"trials": [
{
"trialNum": 1,
"output": "Agent response...",
"duration": 1234,
"pass": true,
"score": 0.9,
"reasoning": "Correct and complete"
}
],
"metadata": {}
}
| Metric | Description |
|---|---|
passRate | Simple ratio: passes / k |
pass@k | Probability of at least one pass in k samples |
pass^k | Probability of all k samples passing |
avgFlakiness | pass@k - pass^k (indicates non-determinism) |
Use compare-trials to evaluate strategy changes:
Run baseline evaluation
bunx plaited eval '{
"adapterPath": "./adapters/baseline.ts",
"promptsPath": "./prompts.jsonl",
"k": 5,
"outputPath": "./baseline.jsonl"
}'
Run challenger evaluation
bunx plaited eval '{
"adapterPath": "./adapters/challenger.ts",
"promptsPath": "./prompts.jsonl",
"k": 5,
"outputPath": "./challenger.jsonl"
}'
Compare results
bunx plaited compare-trials '{
"baselinePath": "./baseline.jsonl",
"challengerPath": "./challenger.jsonl",
"baselineLabel": "strategy-a",
"challengerLabel": "strategy-b"
}'
Prompts are provided as JSONL (newline-delimited JSON):
{"id": "case-1", "input": "What is 2+2?"}
{"id": "case-2", "input": "Write a hello world in Python"}
{"id": "case-3", "input": ["First turn", "Follow up question"], "hint": "Multi-turn"}
| Field | Required | Description |
|---|---|---|
id | Yes | Unique test case identifier |
input | Yes | Prompt text or array of strings for multi-turn |
hint | No | Grader context hint |
reference | No | Reference solution |
metadata | No | Categorization metadata |
timeout | No | Per-case timeout override (ms) |
plaited-eval-adapters for writing eval-compatible adapter scripts