Skill

sensei-prove-it

Evaluate what the tests actually prove and what they miss. Use when reviewing a PR's test coverage, when a developer says "I wrote tests", or when you want to challenge the developer to reason about test quality — not just test quantity.

npx claudepluginhub onehorizonai/sensei --plugin sensei

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Evaluate what the tests prove and what they fail to prove.

Supporting Assets

agents/openai.yamlassets/icon.svg

SKILL.md

Similar Skills

skill-lookup

161.9k

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

karpathy-guidelines

123.4k

Guides code writing, review, and refactoring with Karpathy-inspired rules to avoid overcomplication, ensure simplicity, surgical changes, and verifiable success criteria.

andrej-karpathy-skills

ui-ux-pro-max

76.2k

Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.

ui-ux-pro-max

Stats

Stars0

Forks0

Last CommitMay 11, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Test Proof

Evaluate what the tests prove and what they fail to prove.

Philosophy

Tests that pass are not the same as tests that verify the behavior.

A test suite that checks the happy path, mocks every dependency, and never tests error paths may give 80% coverage while proving almost nothing real.

The question is not: did the tests pass? The question is: what would have to be true about the code for these tests to catch a regression?

Ask the developer to answer that question before reviewing the tests yourself.

Questions to ask

What does each test actually verify?

Is it testing the contract (inputs and outputs) or the implementation (internal calls)?
Would this test catch the most likely bugs?
Does the test name describe what it proves?

What is not tested?

What happens when inputs are invalid or at their boundaries?
What happens when a dependency fails?
What happens with empty arrays, zero values, null, or very large inputs?
What happens under concurrent access?

Are security-sensitive behaviors tested?

If sign-in or permissions changed: is blocked access tested, not just allowed access?
If user input changed: are malformed, hostile, or unexpected inputs tested?
If secrets, personal data, customer account data, or logs changed: is accidental exposure tested or manually verified?

Are characterization tests present for legacy or unfamiliar code?

If this code was modified without being fully understood: is there a test that documents the current behavior before any changes were made?
A characterization test pins what the code does, not what it should do. It exists to catch unintended behavior change — not to validate correctness.
Would the existing tests catch a subtle behavioral regression in this area, or only verify the new behavior?

Are the mocks meaningful?

Do the mocks return realistic data?
Would a real dependency behave the same way as the mock in edge cases?
Is the test actually testing the mock's behavior rather than the code's?

Is this the right test level?

Should this be a unit test, integration test, or end-to-end test?
Is the test isolated at the right boundary, or is it testing too much or too little at once?

What is the failure mode?

If the behavior this test covers were to break, would the test catch it?
Would the test still pass even if [specific behavior] were removed?

Output format

Plain-English takeaway:
[Whether a non-technical owner should feel confident, cautious, or blocked]

What the tests prove:
[Concrete list — not "tests happy path" but "proves that X returns Y when Z"]

What the tests do not prove:
[Specific uncovered scenarios]

Characterization coverage:
[If legacy or unfamiliar code changed: what current behavior was pinned before the change, "missing", or "not applicable"]

Security coverage:
[What sign-in, permission, input, secret, privacy, or customer-boundary behavior is proven, missing, or not applicable]

Riskiest uncovered case:
[The one missing test most likely to correspond to a real bug]

Mock quality:
[Are mocks realistic? What assumptions do they embed?]

Evidence to add next:
[The smallest test or check that would reduce the biggest risk]

Question for you:
[A specific question about what the developer was trying to prove]

Rules

Do not count test files or coverage percentages. Read what the tests actually assert.
Be specific: "this test does not cover the case where X is null" not "coverage is insufficient."
Ask the developer to explain what each test is for before critiquing it.
Explain test gaps as real-world risk: what could break even though tests pass.
If security-sensitive behavior is touched, missing "blocked access" tests are a serious gap.
Distinguish "missing coverage" from "wrong abstraction level" — they require different fixes.
If a test would still pass after removing the behavior it claims to cover, that is a critical finding.
Praise tests that cover edge cases explicitly — that habit is worth reinforcing.
If the code was touched without prior tests: ask whether a characterization test was written before the change. If not, that is the first gap to address — not the missing edge cases.
Do not treat a characterization test as proof the behavior is correct. It proves the behavior stayed stable.