From trailmark
Triages mutation testing survivors and necessist results with Trailmark graphs to flag false positives, test gaps, and fuzz targets. Use after running mutators like circomvent.
npx claudepluginhub trailofbits/skills --plugin trailmarkThis skill uses the workspace's default tool permissions.
Combines mutation testing and necessist (test statement removal) with
Runs mutation testing workflow: mutates source code, executes tests per mutation, identifies survivors, generates tests for them, commits per module. Multi-session progress tracking in .test-mutations.json.
Runs mutation testing on test suites using stack-specific tools like Stryker (JS), Infection (PHP), Mutmut (Python), and go-mutesting (Go) to validate test quality. Use for verifying test effectiveness.
Runs mutation testing to verify tests catch bugs by introducing mutants into code and checking if tests fail. Extends ATDD workflow as third validation after green acceptance and unit tests.
Share bugs, ideas, or general feedback.
Combines mutation testing and necessist (test statement removal) with code graph analysis to triage findings into actionable categories: false positives, missing unit tests, and fuzzing targets.
uv run trailmark fails, run:
uv pip install trailmark
DO NOT fall back to "manual verification" or "manual analysis"
as a substitute for running trailmark. Install it first. If installation
fails, report the error instead of switching to manual analysis.cargo install necessist.
See references/mutation-frameworks.md
for details.ulimit -n 1024 before any mull-runner
invocation. macOS Tahoe (26+) sets unlimited file descriptors by
default, which crashes Mull's subprocess spawning. See
references/mutation-frameworks.md
for details.| Rationalization | Why It's Wrong | Required Action |
|---|---|---|
| "All survived mutants need tests" | Many are harmless or equivalent | Triage before writing tests |
| "Mutation testing is too noisy" | Noise means you're not triaging | Use graph data to filter |
| "Unit tests cover everything" | Complex data flows need fuzzing | Check entrypoint reachability |
| "Dead code mutants don't matter" | Dead code should be removed | Flag for cleanup |
| "Low complexity = low risk" | Boundary bugs hide in simple code | Check mutant location |
| "Tool isn't installed, I'll do it manually" | Manual analysis misses what tooling catches | Install the tool first |
| "Necessist isn't mutation testing, skip it" | Necessist finds what mutation testing misses: weak tests | Run both when the language supports it |
# 1. Build the code graph
uv run trailmark analyze --summary {targetDir}
# 2. Run mutation testing (language-dependent)
# Python:
uv run mutmut run --paths-to-mutate {targetDir}/src
uv run mutmut results
# 2b. Run necessist (if language supported)
necessist
# 3. Analyze results with this skill's workflow (Phase 3)
Phase 1: Graph Build → Parse codebase with trailmark
↓
Phase 2: Mutation Run → Execute mutation testing framework
Phase 2b: Necessist Run → Remove test statements (optional, parallel)
↓
Phase 3: Triage → Classify findings using graph data
↓
Output: Categorized Report
├── Corroborated (both tools flag same function — highest value)
├── False Positives (harmless, skip)
├── Missing Tests (write unit tests)
└── Fuzzing Targets (set up fuzz harnesses)
├─ Need to set up mutation testing for a language?
│ └─ Read: references/mutation-frameworks.md
│
├─ Need to set up necessist or find weak test statements?
│ └─ Read: references/mutation-frameworks.md (Necessist section)
│
├─ Need to understand the triage criteria in depth?
│ └─ Read: references/triage-methodology.md
│
├─ Need to understand how graph data informs triage?
│ └─ Read: references/graph-analysis.md
│
└─ Already have results + graph? Use Phase 3 below.
Parse the target codebase with trailmark and run pre-analysis before mutation testing. Pre-analysis computes blast radius, entry points, privilege boundaries, and taint propagation, which Phase 3 uses for triage.
uv run trailmark analyze --summary {targetDir}
Use the QueryEngine API to build the graph and run pre-analysis:
QueryEngine.from_directory("{targetDir}", language="{lang}")engine.preanalysis() — mandatory before triageengine.to_json() for cross-referencing with mutation resultsSee references/graph-analysis.md for the full API: node mapping, reachability queries, blast radius, and pre-analysis subgraph lookups.
Select and run the appropriate framework. See references/mutation-frameworks.md for language-specific setup.
Capture survived mutants. Each framework reports differently, but extract these fields per mutant:
| Field | Description |
|---|---|
| File path | Source file containing the mutant |
| Line number | Line where mutation was applied |
| Mutation type | What was changed (operator, value, etc.) |
| Status | survived, killed, timeout, error |
Filter to survived mutants only for Phase 3.
If the target language is supported (Go, Rust, Solidity/Foundry, TypeScript/Hardhat, TypeScript/Vitest, Rust/Anchor), run necessist to find unnecessary test statements. This runs independently of Phase 2 and can execute in parallel.
# Auto-detect framework
necessist
# Or target specific test files
necessist tests/test_parser.rs
# Export results
necessist --dump
Filter to findings where the test passed after removal. See references/mutation-frameworks.md for framework-specific configuration and the normalized record format.
Map each removal to a production function using the algorithm in references/graph-analysis.md.
For each survived mutant and each necessist removal, determine its triage bucket using graph data. Necessist removals must first be mapped to a production function (see references/graph-analysis.md).
| Signal | Bucket | Reasoning |
|---|---|---|
| No callers in graph | False Positive | Dead code, mutant is unreachable |
| Only test callers | False Positive | Test infrastructure, not production |
| Logging/display string | False Positive | Cosmetic, no behavioral impact |
| Equivalent mutant | False Positive | Behavior unchanged despite mutation |
| Simple function, low CC, no entrypoint path | Missing Tests | Unit test is straightforward |
| Error handling path | Missing Tests | Should have negative test cases |
| Boundary condition (off-by-one) | Missing Tests | Property-based test candidate |
| Pure function, deterministic | Missing Tests | Easy to test, high value |
| High CC (>10), entrypoint reachable | Fuzzing Target | Complex + exposed = fuzz it |
| Parser/validator/deserializer | Fuzzing Target | Structured input handling |
| Many callers (>10) + moderate CC | Fuzzing Target | High blast radius |
| Binary/wire protocol handling | Fuzzing Target | Fuzzers excel at format testing |
| Signal | Bucket | Reasoning |
|---|---|---|
| Redundant setup or debug call | False Positive | Statement genuinely unnecessary |
| Cannot map to production function | False Positive | No graph context for triage |
| Call removed, no assertion checks its effect | Missing Tests | Test has weak assertions |
| Assertion removed, test still passes | Missing Tests | Redundant or insufficient coverage |
| Maps to high-CC entrypoint-reachable function | Fuzzing Target | Complex + exposed + weak test |
When both mutation testing and necessist flag the same production function, mark as corroborated — highest confidence finding.
For detailed criteria, see references/triage-methodology.md.
For each mutant, map it to its containing graph node and use pre-analysis subgraphs (tainted, high_blast_radius, privilege_boundary) from Phase 1 to classify it. The classification logic checks: no callers → false positive, privilege boundary → fuzzing, high CC + tainted → fuzzing, high blast radius → fuzzing, otherwise → missing tests.
See references/graph-analysis.md for
the batch_triage implementation and node mapping functions.
Generate a markdown report:
# Genotoxic Triage Report
## Summary
- Total survived mutants: N
- Total necessist removals: N
- Corroborated findings: N
- False positives: N (N%)
- Missing test coverage: N (N%)
- Fuzzing targets: N (N%)
## Corroborated Findings
| File | Line | Function | Mutation Signal | Necessist Signal | Action |
|------|------|----------|----------------|------------------|--------|
## False Positives
| File | Line | Mutation | Reason | Source |
|------|------|----------|--------|--------|
## Missing Test Coverage
| File | Line | Function | CC | Callers | Suggested Test | Source |
|------|------|----------|----|---------|----------------|--------|
## Fuzzing Targets
| File | Line | Function | CC | Entrypoint Path | Blast Radius | Source |
|------|------|----------|----|-----------------|--------------|--------|
The Source column is mutation, necessist, or corroborated.
Write the report to GENOTOXIC_REPORT.md in the working directory.
Before delivering:
GENOTOXIC_REPORT.mdtrailmark skill:
property-based-testing skill:
testing-handbook-skills (fuzzing):
harness-writing, cargo-fuzz, atherisFirst-time users: Start with Phase 1 (graph build), then run mutations, then use the Quick Classification table in Phase 3.
Experienced users: Jump to Phase 3 and use the Decision Tree to load specific reference material.