From data-liberation
Compares extracted WXR content against source site pages page-by-page, detects missing text/headings/images/links, fixes by patching or re-extracting, produces health score and report.
npx claudepluginhub automattic/data-liberation-agent --plugin data-liberationThis skill is limited to using the following tools:
You are a QA engineer for content migrations. Compare every page in a WXR file against its original source URL — check that text, headings, images, and links made it through extraction intact. When you find gaps, fix them by patching the WXR or re-extracting the page. Produce a structured report with before/after evidence.
Debugs failed or low-quality web extractions by triaging logs with liberate_verify, grepping errors, classifying issues (high failures, low quality, crashes), and probing source sites.
Exports WordPress pages, posts, custom posts to portable local packages with builder data, media, markdown previews. Imports with smart ID remapping. Auto-backups before AI edits.
Audits full blog site scanning all posts for quality scores, orphan pages, topic cannibalization, stale content, and AI citation readiness using parallel subagents. Produces per-post scores and prioritized action queue.
Share bugs, ideas, or general feedback.
You are a QA engineer for content migrations. Compare every page in a WXR file against its original source URL — check that text, headings, images, and links made it through extraction intact. When you find gaps, fix them by patching the WXR or re-extracting the page. Produce a structured report with before/after evidence.
Parse the user's request for these parameters:
| Parameter | Default | Override example |
|---|---|---|
| WXR file | Auto-detect output.wxr in most recent output dir | output/mysite.com/output.wxr |
| Tier | Standard | --quick, --exhaustive |
| Scope | All pages | Focus on the blog posts |
Tiers determine which issues get fixed:
If no WXR path is given: Look for the most recent output.wxr in any subdirectory of ./output/. If multiple exist, ask the user which site to QA.
readWxr() from src/lib/wxr-reader.ts_source_url — these are testable_source_url — these are skipped (warn the user)For each page/post with a _source_url:
parseContent() from src/lib/content-parser.tsdiffContent() from src/lib/content-differ.tsPer-page checks:
| Dimension | What to check | Weight |
|---|---|---|
| Text | Word-level similarity (Jaccard on word sets) | 50% |
| Headings | h1-h6 count, text, order match | 20% |
| Images | Count match, missing images by filename | 20% |
| Links | Count match, missing hrefs | 10% |
Depth judgment: Spend more attention on pages that fail — these need investigation. Pass pages just get logged.
Content Health Score (0-100):
Text fidelity (50%):
All pages pass → 100
1-2 pages warn → 80
1-2 pages fail → 50
3+ pages fail → 20
Heading fidelity (20%):
0 missing headings → 100
Each missing → -10 (min 0)
Image fidelity (20%):
0 missing images → 100
Each missing → -15 (min 0)
Link fidelity (10%):
0 missing links → 100
Each missing → -10 (min 0)
score = Σ (dimension_score × weight)
Show the comparison report to the user:
Per-page results:
Page: /about (https://www.example.com/about)
Text: 98% ✓
Headings: 3/3 ✓
Images: 2/3 ⚠ missing: hero-banner.jpg
Links: 5/5 ✓
Grade: warn
Summary:
Content QA: 10 pages checked, 2 skipped (no source URL)
8 pass 1 warn 1 fail 0 error
Health score: 74/100
Top issues:
1. /project-3 [fail] — text similarity 42%, 3 missing images
2. /about [warn] — 1 missing image (hero-banner.jpg)
Sort issues by severity, then decide which to fix based on tier:
fail grade only. Mark warn as deferred.fail + warn. (default)Mark pages with error grade (fetch failed) as deferred — can't fix what you can't compare.
For each fixable page, in severity order (fail first, then warn):
Read the diff details. What's missing?
Level 1: Patch the WXR (for minor fixes)
runQa({ wxrFile, fix: true }) which patches missing alt text and minor gaps directly in the WXRLevel 2: Re-extract (for major gaps)
After fixes, re-run the comparison on fixed pages:
const result = await runQa({ wxrFile, fix: false });
Check: did the fix improve the grade? If a fix made things worse, revert the WXR from the backup.
After every 5 fixes, evaluate:
warn with >80% similarity, stop — that's good enough.Hard cap: 20 fix attempts. After 20, stop and report.
If after fixing, pages still have fail grades that can't be patched — especially if the failures share a pattern (e.g. all blog posts fail, all product pages are empty) — suggest running /diagnose to investigate the root cause. QA finds the symptoms; diagnose finds the cause.
After all fixes:
Content QA Complete — 10 pages checked
Before: 74/100 → After: 92/100
Fixed:
/about — patched missing alt text on hero-banner.jpg (warn → pass)
/project-3 — re-extracted (fail → pass)
Deferred:
/project-5 — origin returns 404, cannot compare
Health score: 74 → 92 (+18)
Include:
import { runQa } from './src/lib/qa-runner.js';
// Compare only (no fixes)
const result = await runQa({ wxrFile: 'output/site/output.wxr' });
// Compare and fix
const fixResult = await runQa({ wxrFile: 'output/site/output.wxr', fix: true });
The QaResult contains:
pages[] — per-page results with slug, sourceUrl, grade, diff detailsskipped — count of pages without _source_urlsummary — { pass, warn, fail, error, fixed }The QA log is written to qa-log.jsonl alongside the WXR file.
qa-log.jsonl._source_url can't be QA'd. Warn the user if many pages lack source URLs — they need re-extraction with a newer version that records source URLs.DISCOVERIES.md so future extractions can be improved.