Skill

refactor-verify

Verifies behavior-preserving refactors (rename, split, merge, extract, inline, dead code delete) via dependency tree planning, symbol-set/AST diffs, full test suite, and call-site reference closure.

Git

Python

npx claudepluginhub subinium/vibesubin --plugin vibesubin

Tool Access

This skill is limited to using the following tools:

Read Edit Grep Glob Bash(git diff *) Bash(git status *) Bash(git log *) Bash(git worktree *) Bash(git switch *) Bash(git stash *) Bash(git rev-parse *) Bash(git show *) Bash(git branch *) Bash(git reset *) Bash(git grep *) Bash(npm *) Bash(pnpm *) Bash(yarn *) Bash(pytest *) Bash(cargo *) Bash(go test *) Bash(tsc *) Bash(ruff *) Bash(eslint *) Bash(mypy *) Bash(python *) Bash(node *)

Preview

The operator asked for a change that's supposed to preserve behavior — a refactor, a rename, a split, an extract, a dead-code deletion. Your job is to *prove* that behavior was preserved, not just produce a diff that looks right.

Supporting Assets

references/language-smoke-tests.mdreferences/llm-failure-modes.mdreferences/verification-procedure.mdscripts/callsite-count.shscripts/smoke-test.shscripts/symbol-diff.sh

SKILL.md

Similar Skills

crefactor

Orchestrates safe refactoring ensuring tests pass before/after changes. Requires explicit approval for test modifications and generates characterization tests for low-coverage code.

9 tools

correctless

refactor

Runs autonomous iterative refactoring to improve tactical code quality (DRY, dead code, naming, complexity) via SME agents, QA verification, atomic git commits, looping until no improvements remain.

1 file

claude-swe-workflows

Code Refactor Workflow

Guides refactoring code structure and modules to improve maintainability, readability, and performance while preserving external behavior and ensuring all tests pass.

hatch3r

Stats

Parent Repo Stars11

Parent Repo Forks2

Last CommitApr 14, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

refactor-verify

The operator asked for a change that's supposed to preserve behavior — a refactor, a rename, a split, an extract, a dead-code deletion. Your job is to prove that behavior was preserved, not just produce a diff that looks right.

Behavior-preserving changes are the single biggest source of silent regressions when an LLM touches code. The classic failure is: the AI moves a function, updates the definition, and misses one of several call sites. The tests still pass because coverage was never complete. No one notices until a user hits the broken path.

This skill exists to stop that from happening. It covers two change families:

Structural refactors — move, rename, split, merge, extract, inline. The behavior is supposed to be identical afterward.
Safe deletions — removing code the operator has confirmed is dead (usually via fight-repo-rot). The behavior is supposed to be identical because the code wasn't running.

Both families use the same four verification checks.

The invariant

A change is not done until all four of these pass:

Symbol-set diff — every public/exported name that existed before the refactor still exists after it (or was deliberately removed). No silent drops.
AST body diff — every moved function/class body is structurally equivalent to its original, normalizing whitespace and comments.
Behavioral verification — typecheck, lint, smoke test (can the code even load?), test suite. All green.
Call-site closure — every reference to the moved symbol has been updated. The count of references before and after must match.

If any of the four fails, fix it and re-run all four. Do not partially claim success.

Procedure — 6 steps, run in order

Each step produces an artifact or an assertion. Never skip.

Step 1 — Snapshot and isolate before touching anything

Record the starting state. You will diff against this at the end, and you will need it to roll back cleanly if something goes wrong.

Isolation — pick one of two options:

Preferred: git worktree. Create a separate working directory for the change. Nothing in the operator's main checkout is touched while you work. Rollback = git worktree remove.
```
SNAP=$(git rev-parse HEAD)
git worktree add ../verify-<topic> "$SNAP"
cd ../verify-<topic>
git switch -c verify/<topic>
```
Acceptable: temporary work branch on the main checkout. Only if worktree isn't available (e.g., the operator's filesystem doesn't support multiple worktrees, or the repo has submodules that don't co-exist).
```
SNAP=$(git rev-parse HEAD)
git switch -c verify/<topic>
```

Do not use git stash as an isolation mechanism. Stashes are interruption-unsafe — a crash, a power loss, or another git stash from a second session can lose the stash ref, and recovery is manual. A branch is cheap, always recoverable, and always preferable. This rule is non-negotiable.

Never work directly on the operator's main branch. Even a three-line change gets a branch.

Capture the baseline — language-agnostic:

SNAP=$(git rev-parse HEAD) — the commit you'll diff against and roll back to
Exported symbol set per file (helper: scripts/symbol-diff.sh)
Full test run result (pass/fail, counts) — save to /tmp/verify-baseline-tests.txt
Build/typecheck result — save to /tmp/verify-baseline-build.txt
Lint result — save to /tmp/verify-baseline-lint.txt

Bootstrap the isolated workspace before behavioral checks. Fresh worktrees often do not contain project-local state like .venv/, node_modules/, vendor/, generated clients, or compiled assets. A "module not found" failure in a brand-new worktree is an environment problem until proven otherwise.

If the baseline is already red, stop. You cannot verify a change on top of broken code — every failure afterwards becomes ambiguous (did I cause it or was it already broken?). Ask the operator whether to fix the existing failures first. Do not proceed.

Rollback plan (write it down before starting):

# If anything goes wrong, this restores the operator's world exactly:
git switch main                           # leave the verify branch
git worktree remove ../verify-<topic>     # if you used a worktree
# or: git branch -D verify/<topic>        # if you used a branch
git reset --hard "$SNAP"                  # only if you somehow touched main

Keep $SNAP visible throughout the session. It is the single most important value.

Step 2 — Plan as a dependency tree

Do not start editing. Write the change down as a tree where each node is a small step, and its children are the prerequisite steps that must happen first. Execute leaves before parents. The principle is old — Mikado Method is one well-known formalization — but the shape is what matters: commit the dependency graph to a scratch file before touching any code, so when something fails you can tell whether the plan was wrong or the execution was wrong.

Example tree for "split big_module into big_module_core and big_module_helpers":

Split big_module
├── Create big_module_helpers.* with helper functions
│   └── Identify which functions are pure helpers (no cross-refs into big_module)
├── Create big_module_core.* with the rest
│   └── Update imports inside the moved bodies
├── Replace `import big_module` with `import big_module_core` (plus helpers) everywhere
│   └── Inventory all current import sites
└── Delete the original big_module
    └── Confirm no remaining references

Example tree for "delete confirmed-dead legacy_reports module":

Delete legacy_reports
├── Verify zero external references (grep + LSP find-references)
├── Delete legacy_reports/ directory
├── Remove any import of legacy_reports from __init__.py / index.ts
└── Confirm symbol-set diff shows only the expected removals

Each node in the tree has three fields:

Files touched (concrete paths — never src/*)
Verification command (how you'll prove this node succeeded)
Invariant preserved (what must still be true after this node commits)

Write the tree as a scratch file (e.g., .verify-plan.md, gitignored). Show it to the operator before proceeding if the change touches more than 3 files. Let them approve or adjust.

Detailed pattern, including when to linearize vs branch the tree and how to handle cross-cutting constraints: references/verification-procedure.md.

Step 3 — Execute from the leaves up — structural commits only

Follow Tidy First: a commit is either structural (move, rename, extract, inline) or behavioral (logic change). Never mixed. This makes every commit independently verifiable.

For each leaf:

Make the change on disk
Run the leaf's verification command
If it passes, commit with a clear message (see write-for-ai skill for the format)
Move up the tree

If a leaf fails, do not move up. Fix the leaf or revise the tree. A failed leaf that you skip becomes a load-bearing bug later.

Step 4 — Per-node AST diff (structural preservation)

For every function, class, or constant that got moved in this step, prove its body survived intact.

Extract the symbol from the old location (using the snapshot from step 1 — or git show HEAD~1:old_path). Extract it from the new location. Normalize both (strip whitespace, strip comments, strip trailing commas). Compare.

If the language has AST tooling:

ast-grep --lang <lang> --pattern '<pattern>' <path> — find the symbol
tree-sitter parse <path> — get the AST

If it doesn't, fall back to a normalized text diff:

# pseudocode
old_body=$(extract_symbol_from_git_show HEAD~1:old_path symbol_name | normalize)
new_body=$(extract_symbol_from_file new_path symbol_name | normalize)
diff <(echo "$old_body") <(echo "$new_body")

The goal is a zero-byte diff on the normalized form. If there's a diff, you didn't just move the symbol — you also mutated it. That's a behavioral change hiding in a structural commit, which violates Tidy First.

Detail: see the "Step 4 — Per-node AST (or byte) diff" section of references/verification-procedure.md.

Step 5 — Behavioral verification (can it still run)

Run the full smoke-test chain for the language. Never skip because "it's just a rename."

Order matters — fail fast on the cheapest check first:

Syntax/parse (python -c 'import x', node -e "require('./')", cargo check, go build ./...)
Typecheck (mypy, tsc --noEmit, cargo check, go vet)
Lint (ruff, eslint, clippy, golangci-lint)
Import smoke test — attempt to load every module touched, not just the obvious ones
Unit tests (pytest, jest, cargo test, go test ./...)
Integration tests (if any)
Local run / dev server (if the project has one, hit one known endpoint and assert 200)

If any step fails, return to step 4 — something you thought was structural is actually behavioral.

Per-language details: references/language-smoke-tests.md

Step 6 — Call-site closure

This is the step that catches the #1 LLM failure mode in refactors and deletions: a rename that updated the definition but missed half the callers, or a "dead code" deletion where one caller turns out to be live.

Before the change (from step 1 snapshot), count every reference to each affected symbol. Prefer an LSP find-references query — gopls, rust-analyzer, pyright, typescript-language-server — because grep is a lower bound, not a ground truth.

# LSP preferred (accurate for cross-file symbols, handles dynamic dispatch)
# — run your editor's find-references or the CLI equivalent

# grep fallback — works for every language but is a lower bound
git grep -n 'old_symbol_name' | wc -l   # from the snapshot state

After the change, count references to the new name — plus any remaining references to the old name that shouldn't exist:

git grep -n 'new_symbol_name' | wc -l   # should equal the old count (renames) or zero (deletions of confirmed-dead code)
git grep -n 'old_symbol_name' | wc -l   # should equal 0 (or the number of deliberately-kept aliases)

If the counts don't match, you missed a callsite or your "dead" symbol wasn't actually dead. Go find the reference. Do not report done.

Treat renames and deletions as two-assertion checks, not one:

Operation	Assertion 1	Assertion 2
Rename	old-count before == new-count after	old-count after == `0`
Delete (dead code)	old-count before == `0` (confirmed before deleting)	old-count after == `0`
Move	old-path-count after == `0`	new-path-count after == old-path-count before

The helper supports this directly:

scripts/callsite-count.sh "$SNAP" HEAD old_name new_name

Grep is a lower bound. A match count of zero in grep does not prove the symbol is unreferenced — dynamic dispatch, reflection, string-built symbol names, and cross-language boundaries all evade grep. When deleting code based on a grep-zero result, label the confidence as MEDIUM and require a second signal (LSP find-references, import graph analysis, or an explicit operator confirmation).

When to stop and report

Report the change complete only when:

All six steps passed for every node in the plan
The behavioral verification is green on the post-change commit
The symbol-set diff, AST diff, smoke tests, and call-site closure all pass
You can produce the verification commands the operator can re-run

Report template (use in the PR description or session output):

## refactor-verify report

Change type: <refactor | rename | split | move | extract | inline | delete-dead>
Plan: <1-line summary of the dependency tree>

### Symbol-set diff
- Before: <N> exported symbols across <M> files
- After:  <N> exported symbols across <M'> files
- Diff:   <explicit delta, or "none">

### AST body preservation
- <N> moved symbols, all normalized-equivalent to source

### Behavioral
- Typecheck: ✅
- Lint:      ✅
- Tests:     ✅ (X/Y passing, unchanged from baseline)
- Smoke:     ✅ (imports + local run)

### Call-site closure
- Old references: <N> (from snapshot)
- New references: <N> (equal, verified)
- Orphans:        0

Verification commands (re-run anytime):
  <concrete command 1>
  <concrete command 2>

LLM failure modes to guard against

Documented failure patterns. If any of these trip during execution, the verification system should catch them — but it's faster to avoid them upfront.

Missed call sites — The #1 issue. Guard: step 6.
Hallucinated imports — You added from foo.bar import baz but foo.bar doesn't have baz. Guard: step 5's import smoke test.
Silent field omission — When copying a dict, class, or config block, one key disappears. Guard: symbol-set diff (step 1 vs post-change) and AST body diff (step 4).
Scope promotion — A local variable accidentally became a module-level variable during extraction. Guard: step 4 normalized diff.
Partial edits — Three of four callers were updated. Guard: step 6 count-match.
Comment/docstring drift — The change updated the code but not the docstring describing the old shape. Guard: treat comments inside moved symbols as part of the AST diff.
False-dead deletion — A "dead" symbol was actually reachable via dynamic dispatch / reflection / a string-built name. Guard: step 6 LSP-first references with explicit HIGH/MEDIUM/LOW confidence before deletion.

Full catalog with examples and detection notes: references/llm-failure-modes.md.

Information preservation during doc rewrites

If the change includes a README or architecture doc rewrite, apply the same discipline:

Before: extract the full list of concrete facts from the old doc (file paths, function names, environment variables, URLs, numbers, personal names, proper nouns).
After rewrite: grep each of those terms in the new doc. Any term that was in the old doc and is missing from the new doc is either a deliberate simplification (document the decision) or a silent loss (fix).
If the new doc changed section headers, keep a "moved from" note at least briefly so search still works.

See "Information preservation during doc rewrites" and the general preservation pattern in references/verification-procedure.md.

When not to use this skill

Typo fix in a single comment → just edit it
Adding a new feature (non-refactor) → this skill is for changes that preserve behavior, not for new behavior
Deleting code without a dead-code confirmation → run fight-repo-rot first to confirm the target is dead with explicit confidence, then come back here for the delete + verify
A single-file rename the operator says is trivial → still run step 5 and 6; step 2's tree is overkill but the verification isn't

What to do if the baseline is red

You cannot verify a change on top of broken tests. If step 1 discovers failing tests or typecheck errors:

Tell the operator which tests/checks are failing
Ask whether they are known failures (to be excluded from the baseline) or unknown failures (to be fixed first)
If unknown, offer to investigate the failures before the change
Do not proceed into step 2 with a red baseline

A change on top of red is indistinguishable from a change that caused red. You lose the ability to prove causation.

Sweep mode — read-only audit

When invoked from /vibesubin (the umbrella skill's parallel sweep), this skill runs in read-only audit mode. Do not snapshot, do not plan a dependency tree, do not execute any change, do not run the 6-step procedure.

Instead, produce a findings-only report:

Current baseline state: do tests pass right now, does typecheck pass, does lint pass, does the project even build? (Capture the baseline status, do not run recovery.)
Recent refactor risk: any commits in the last 30 days that restructured code without evidence of verification (no test results in PR bodies, no symbol-diff output, etc.).
Obvious hidden regression risk: exported symbols that appear in the repo but never show a call-site count, files that match the god-file pattern AND have a recent high-churn count, pending renames visible in branch names but not yet merged.
Stoplight verdict: 🟢 repo is in a green baseline and can accept a refactor / 🟡 baseline has known failures that block verification / 🔴 baseline is red and no refactor can be verified on top of it.
A one-line "fix with" pointer indicating /refactor-verify will plan and execute any specific change when invoked directly.

The operator reviews the sweep report and, if they want a specific refactor applied, invokes /refactor-verify directly — which then runs the full 6-step procedure.

How to tell: the task context from the umbrella will include a sweep=read-only marker or an explicit "produce findings only, do not edit" instruction. Obey it. If the operator invokes this skill by name, the full procedure applies and editing is expected.

Harsh mode — no hedging

When the task context contains the tone=harsh marker (usually set by the /vibesubin harsh umbrella invocation, but can also come from direct requests like "don't sugarcoat" / "brutal review" / "매운 맛" / "厳しめ"), switch output rules on both the sweep audit and the direct-invocation report:

Lead with the baseline verdict. First line of the report is whether the project is in a green baseline (refactor-safe) or red (cannot verify anything on top). No preamble.
Pass/fail verdicts are binary. Drop "mostly passing", "almost green", "a couple of flakes". A refactor is done or not done. If three of four verification passes succeeded and the fourth didn't, the change is not done — say so in one sentence.
Direct attribution on every failure. "src/api/user.ts::getUser is referenced 12 times but only 11 were updated. The one missed call site is tests/api/user.test.ts:47." Not "some call sites may still reference the old name."
No "verification is strongly recommended" hedging. The pack's entire premise is that verification is mandatory. Harsh mode restates this as a refusal: "Refusing to report done — the test suite did not run because the worktree was not bootstrapped." Not "you may want to run the tests first."
Dead-code deletions are direct. Balanced mode says "candidate for removal, confirm the LSP reference count first". Harsh mode says "LSP returns zero references, grep returns zero references — delete it."
Call-site closure mismatches are escalations. If the old-count and new-count don't match, harsh mode frames it as a bug-in-the-refactor, not a note-for-later. "The rename is incomplete. Do not merge this PR."

Harsh mode does not invent failures, skip verification steps, or become rude. Every harsh statement must cite the same symbol count, grep output, or test result the balanced version would cite. The change is framing, not substance.

Integration with other vibesubin skills

Input from fight-repo-rot — dead-code findings (with confidence) come in here as delete-dead jobs. The handoff includes the list of symbols and the confidence level; LOW-confidence deletions require an explicit operator OK.
Output to write-for-ai — hand off the commit and PR writing; write-for-ai knows how to document verification results.
Config-touching changes — coordinate with manage-secrets-env (if the change touches secrets, .env, or .gitignore) or project-conventions (if the change is about branch strategy, directory layout, dep pinning, or path portability) on where new values should live before restructuring.
Security-sensitive changes — auth, crypto, input handling; run audit-security on the result.

Language-agnostic core, language-specific tooling

The 6-step procedure above is language-agnostic. The specific commands for each language (how to typecheck, how to run tests, how to find references) live in references:

references/language-smoke-tests.md — per-language command chains
references/verification-procedure.md — deep-dive into each of the six steps, including dependency-tree planning, AST diffing, and information preservation for doc rewrites
references/llm-failure-modes.md — the catalog of mistakes to guard against, with the specific step that catches each one

Scripts (invoked, not read):

scripts/symbol-diff.sh — print symbol-set diff between two git refs
scripts/smoke-test.sh — auto-detect language, prefer project-local toolchains when present, and warn when an isolated worktree is not bootstrapped enough to trust failures
scripts/callsite-count.sh — count references before/after, with explicit rename support and stale old-name detection