Skill

model-evaluation

Evaluate and compare models for a use case — quality, latency, cost, reliability, and safety benchmarks on your own data.

Install

npx claudepluginhub hpsgd/turtlestack --plugin ai-engineer

Tool Access

This skill is limited to using the following tools:

ReadWriteEditBashGlobGrep

Preview

Evaluate and select a model for $ARGUMENTS.

SKILL.md

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

139.0k

claude-opus-4-5-migration

2 files

Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.

claude-opus-4-5-migration

83.2k

bmad-distillator

7 files

Compresses source documents into lossless, LLM-optimized distillates preserving all facts and relationships. Use for 'distill documents' or 'create distillate' requests.

bmad-pro-skills

43.8k

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 2, 2026

Actions

View Source View Plugin View on GitHub View README

Requirement	Question	Example
Quality threshold	What is the minimum acceptable accuracy on your eval set?	>= 90% accuracy on classification, >= 85% on generation
Latency budget	What is the maximum acceptable response time?	p95 < 2 seconds TTFT, p95 < 5 seconds total
Cost budget	What can you spend per request and per month?	< $0.02 per request, < $3,000/month at projected volume
Context window	How much input data must fit in a single request?	Typical: 2K tokens, maximum: 15K tokens
Reliability	What error and timeout rates are acceptable?	< 0.5% error rate, < 1% timeout rate
Safety	What refusal rate on valid inputs is tolerable?	< 2% false refusal rate on legitimate requests

Requirement

Question

Example

Quality threshold

What is the minimum acceptable accuracy on your eval set?

>= 90% accuracy on classification, >= 85% on generation

Latency budget

What is the maximum acceptable response time?

p95 < 2 seconds TTFT, p95 < 5 seconds total

Cost budget

What can you spend per request and per month?

< $0.02 per request, < $3,000/month at projected volume

Context window

How much input data must fit in a single request?

Typical: 2K tokens, maximum: 15K tokens

Reliability

What error and timeout rates are acceptable?

< 0.5% error rate, < 1% timeout rate

Safety

What refusal rate on valid inputs is tolerable?

< 2% false refusal rate on legitimate requests

Tier	Model class	When to consider	Cost profile
Fast	Small/cheap (Haiku-class)	Classification, extraction, formatting, routing, mechanical tasks	Lowest
Standard	Mid-range (Sonnet-class)	Most features — summarisation, analysis, general generation	Moderate
Capable	Large (Opus-class)	Complex reasoning, creative generation, critical decisions	Highest

Tier

Model class

When to consider

Cost profile

Fast

Small/cheap (Haiku-class)

Classification, extraction, formatting, routing, mechanical tasks

Lowest

Standard

Mid-range (Sonnet-class)

Most features — summarisation, analysis, general generation

Moderate

Capable

Large (Opus-class)

Complex reasoning, creative generation, critical decisions

Highest

Model

Tier

Provider

Rationale for inclusion

Dimension	Metric	How to measure	Weight
Quality	Eval set accuracy	Run all 50+ examples, score against expected outputs. Use automated scoring where possible, human review for subjective quality	Primary
Latency	TTFT + total generation time	Time each request. Report p50, p95, p99. Test at expected concurrency, not just sequential	High
Cost	Per-request cost	(input tokens x input price) + (output tokens x output price) x projected volume	High
Reliability	Error rate + timeout rate	Run eval set 3 times. Record failures, timeouts (>30s), inconsistent outputs across runs	Medium
Context window	Maximum input capacity	Test with largest expected input. Verify quality does not degrade near the window limit	Pass/fail
Safety	Refusal rate on valid inputs	Count how many legitimate eval examples the model refuses. Also test adversarial inputs from the eval set	Medium

Dimension

Metric

How to measure

Weight

Quality

Eval set accuracy

Run all 50+ examples, score against expected outputs. Use automated scoring where possible, human review for subjective quality

Primary

Latency

TTFT + total generation time

Time each request. Report p50, p95, p99. Test at expected concurrency, not just sequential

High

Cost

Per-request cost

(input tokens x input price) + (output tokens x output price) x projected volume

High

Reliability

Error rate + timeout rate

Run eval set 3 times. Record failures, timeouts (>30s), inconsistent outputs across runs

Medium

Context window

Maximum input capacity

Test with largest expected input. Verify quality does not degrade near the window limit

Pass/fail

Safety

Refusal rate on valid inputs

Count how many legitimate eval examples the model refuses. Also test adversarial inputs from the eval set

Medium

### [Model Name] — Results | Dimension | Result | Meets requirement? | |---|---|---| | Quality | [X]% accuracy (N/50 correct) | YES / NO | | Latency (p95) | TTFT: [X]ms, Total: [X]ms | YES / NO | | Cost | $[X] per request, $[X]/month projected | YES / NO | | Reliability | [X]% error rate, [X]% timeout rate | YES / NO | | Context window | Tested at [X] tokens, quality maintained | YES / NO | | Safety | [X]% refusal rate on valid inputs | YES / NO | **Failure analysis:** - [List specific eval examples that failed and why] - [Common failure patterns]

Dimension	Requirement	C1: [model]	C2: [model]	C3: [model]
Quality	>= [X]%	[score]	[score]	[score]
Latency (p95)	< [X]ms	[value]	[value]	[value]
Cost/request	< $[X]	[value]	[value]	[value]
Cost/month	< $[X]	[value]	[value]	[value]
Reliability	< [X]% errors	[value]	[value]	[value]
Context window	[X] tokens	[pass/fail]	[pass/fail]	[pass/fail]
Safety	< [X]% refusals	[value]	[value]	[value]
All requirements met?		YES / NO	YES / NO	YES / NO

Dimension

Requirement

C1: [model]

C2: [model]

C3: [model]

Quality

>= [X]%

[score]

Latency (p95)

< [X]ms

[value]

Cost/request

< $[X]

[value]

Cost/month

< $[X]

[value]

Reliability

< [X]% errors

[value]

Context window

[X] tokens

[pass/fail]

Safety

< [X]% refusals

[value]

All requirements met?

YES / NO

### Selected model: [model name] **Rationale:** - Meets all [N] requirements - Quality: [X]% (above [threshold]% threshold) - Latency: [X]ms p95 (within [budget]ms budget) - Cost: $[X]/request, $[X]/month (within $[budget]/month budget) - [Specific advantage over other candidates that met requirements] **Trade-offs accepted:** - [What you sacrifice by choosing this model over alternatives] - [Any requirement that is met with thin margin — risk area]

Scenario	Fallback action
Primary model unavailable	Route to [fallback model] — tested against eval set, meets requirements with [trade-off]
Sustained latency degradation	Switch to [faster model] with [quality trade-off described]
Cost spike	Rate limit to [N] requests/minute, alert team
Quality degradation detected	Roll back to previous model version, trigger re-evaluation

Scenario

Fallback action

Primary model unavailable

Route to [fallback model] — tested against eval set, meets requirements with [trade-off]

Sustained latency degradation

Switch to [faster model] with [quality trade-off described]

Cost spike

Rate limit to [N] requests/minute, alert team

Quality degradation detected

Roll back to previous model version, trigger re-evaluation

# Model Evaluation: [use case] ## Requirements | Requirement | Threshold | |---|---| ## Candidates | # | Model | Tier | Provider | Rationale | |---|---|---|---|---| ## Evaluation Dataset - Total examples: [N] - Happy path: [N] | Edge cases: [N] | Adversarial: [N] - Location: [path to eval set] ## Results ### [Model 1] [Per-dimension results table + failure analysis] ### [Model 2] [Per-dimension results table + failure analysis] ## Comparison [Side-by-side table with pass/fail per requirement] ## Decision - **Selected:** [model] - **Rationale:** [tied to requirements] - **Trade-offs:** [what you sacrifice] ## Fallback Plan | Scenario | Action | |---|---| ## Re-evaluation Schedule - Next evaluation: [date] - Trigger conditions: [what forces an early re-evaluation]

Requirement	Question	Example
Quality threshold	What is the minimum acceptable accuracy on your eval set?	>= 90% accuracy on classification, >= 85% on generation
Latency budget	What is the maximum acceptable response time?	p95 < 2 seconds TTFT, p95 < 5 seconds total
Cost budget	What can you spend per request and per month?	< $0.02 per request, < $3,000/month at projected volume
Context window	How much input data must fit in a single request?	Typical: 2K tokens, maximum: 15K tokens
Reliability	What error and timeout rates are acceptable?	< 0.5% error rate, < 1% timeout rate
Safety	What refusal rate on valid inputs is tolerable?	< 2% false refusal rate on legitimate requests

Requirement

Question

Example

Quality threshold

What is the minimum acceptable accuracy on your eval set?

>= 90% accuracy on classification, >= 85% on generation

Latency budget

What is the maximum acceptable response time?

p95 < 2 seconds TTFT, p95 < 5 seconds total

Cost budget

What can you spend per request and per month?

< $0.02 per request, < $3,000/month at projected volume

Context window

How much input data must fit in a single request?

Typical: 2K tokens, maximum: 15K tokens

Reliability

What error and timeout rates are acceptable?

< 0.5% error rate, < 1% timeout rate

Safety

What refusal rate on valid inputs is tolerable?

< 2% false refusal rate on legitimate requests

Tier	Model class	When to consider	Cost profile
Fast	Small/cheap (Haiku-class)	Classification, extraction, formatting, routing, mechanical tasks	Lowest
Standard	Mid-range (Sonnet-class)	Most features — summarisation, analysis, general generation	Moderate
Capable	Large (Opus-class)	Complex reasoning, creative generation, critical decisions	Highest

Tier

Model class

When to consider

Cost profile

Fast

Small/cheap (Haiku-class)

Classification, extraction, formatting, routing, mechanical tasks

Lowest

Standard

Mid-range (Sonnet-class)

Most features — summarisation, analysis, general generation

Moderate

Capable

Large (Opus-class)

Complex reasoning, creative generation, critical decisions

Highest

Model

Tier

Provider

Rationale for inclusion

Dimension	Metric	How to measure	Weight
Quality	Eval set accuracy	Run all 50+ examples, score against expected outputs. Use automated scoring where possible, human review for subjective quality	Primary
Latency	TTFT + total generation time	Time each request. Report p50, p95, p99. Test at expected concurrency, not just sequential	High
Cost	Per-request cost	(input tokens x input price) + (output tokens x output price) x projected volume	High
Reliability	Error rate + timeout rate	Run eval set 3 times. Record failures, timeouts (>30s), inconsistent outputs across runs	Medium
Context window	Maximum input capacity	Test with largest expected input. Verify quality does not degrade near the window limit	Pass/fail
Safety	Refusal rate on valid inputs	Count how many legitimate eval examples the model refuses. Also test adversarial inputs from the eval set	Medium

Dimension

Metric

How to measure

Weight

Quality

Eval set accuracy

Run all 50+ examples, score against expected outputs. Use automated scoring where possible, human review for subjective quality

Primary

Latency

TTFT + total generation time

Time each request. Report p50, p95, p99. Test at expected concurrency, not just sequential

High

Cost

Per-request cost

(input tokens x input price) + (output tokens x output price) x projected volume

High

Reliability

Error rate + timeout rate

Run eval set 3 times. Record failures, timeouts (>30s), inconsistent outputs across runs

Medium

Context window

Maximum input capacity

Test with largest expected input. Verify quality does not degrade near the window limit

Pass/fail

Safety

Refusal rate on valid inputs

Count how many legitimate eval examples the model refuses. Also test adversarial inputs from the eval set

Medium

Dimension	Requirement	C1: [model]	C2: [model]	C3: [model]
Quality	>= [X]%	[score]	[score]	[score]
Latency (p95)	< [X]ms	[value]	[value]	[value]
Cost/request	< $[X]	[value]	[value]	[value]
Cost/month	< $[X]	[value]	[value]	[value]
Reliability	< [X]% errors	[value]	[value]	[value]
Context window	[X] tokens	[pass/fail]	[pass/fail]	[pass/fail]
Safety	< [X]% refusals	[value]	[value]	[value]
All requirements met?		YES / NO	YES / NO	YES / NO

Dimension

Requirement

C1: [model]

C2: [model]

C3: [model]

Quality

>= [X]%

[score]

Latency (p95)

< [X]ms

[value]

Cost/request

< $[X]

[value]

Cost/month

< $[X]

[value]

Reliability

< [X]% errors

[value]

Context window

[X] tokens

[pass/fail]

Safety

< [X]% refusals

[value]

All requirements met?

YES / NO

Scenario	Fallback action
Primary model unavailable	Route to [fallback model] — tested against eval set, meets requirements with [trade-off]
Sustained latency degradation	Switch to [faster model] with [quality trade-off described]
Cost spike	Rate limit to [N] requests/minute, alert team
Quality degradation detected	Roll back to previous model version, trigger re-evaluation

Scenario

Fallback action

Primary model unavailable

Route to [fallback model] — tested against eval set, meets requirements with [trade-off]

Sustained latency degradation

Switch to [faster model] with [quality trade-off described]

Cost spike

Rate limit to [N] requests/minute, alert team

Quality degradation detected

Roll back to previous model version, trigger re-evaluation

model-evaluation

Install

Tool Access

Preview

SKILL.md

Similar Skills

model-evaluation

Install

Tool Access

Preview

SKILL.md

Process (sequential — do not skip steps)

Step 1: Requirements Definition

Step 2: Candidate Selection

Step 3: Evaluation Dataset

Step 4: Evaluation Dimensions

Step 5: Run Evaluation

Step 6: Comparison Analysis

Step 7: Decision and Fallback

Anti-Patterns (NEVER do these)

Output Format

Similar Skills

Process (sequential — do not skip steps)

Step 1: Requirements Definition

Step 2: Candidate Selection

Step 3: Evaluation Dataset

Step 4: Evaluation Dimensions

Step 5: Run Evaluation

Step 6: Comparison Analysis

Step 7: Decision and Fallback

Anti-Patterns (NEVER do these)

Output Format