From data-liberation
Debugs failed or low-quality web extractions by triaging logs with liberate_verify, grepping errors, classifying issues (high failures, low quality, crashes), and probing source sites.
npx claudepluginhub automattic/data-liberation-agent --plugin data-liberationThis skill is limited to using the following tools:
Systematically investigate why an extraction failed or produced poor results. Identify root causes and fix them.
Compares extracted WXR content against source site pages page-by-page, detects missing text/headings/images/links, fixes by patching or re-extracting, produces health score and report.
Analyzes Screaming Frog crawl exports (CSV/Excel) for technical SEO audits, duplicate content detection, broken links, schema validation, internal linking, and location page quality.
Audits deployed websites and web apps post-launch, cycling through Fix-Audit-Test phases to identify and resolve issues.
Share bugs, ideas, or general feedback.
Systematically investigate why an extraction failed or produced poor results. Identify root causes and fix them.
Ask for or detect:
| Parameter | How to find it |
|---|---|
| Output directory | Look for most recent output/*/ subdirectory |
| WXR file | output.wxr in the output directory |
| Extraction log | extraction-log.jsonl in the output directory |
| Source URL | From the WXR's <link> element or ask the user |
Start with liberate_verify — it gives you a structured overview in one call:
This replaces manual log grepping for the initial assessment. If you need more detail, then dig into the raw log:
# Count successes vs failures
grep -c '"type":"processed"' output/<site>/extraction-log.jsonl
grep -c '"type":"failed"' output/<site>/extraction-log.jsonl
grep -c '"type":"media_failed"' output/<site>/extraction-log.jsonl
A. High failure rate (>30% failed) Something systematic is wrong — the site is blocking requests, the adapter can't parse the platform, or there's an auth issue.
B. Low failure rate (<30%) with specific pages failing Individual page issues — timeouts, unusual page structures, dynamic content.
C. No failures but low quality content The adapter extracted something but it's the wrong content — nav bars, footers, cookie banners instead of the actual page body.
D. Crash / incomplete extraction The process died mid-way. Check for the lock file, partial WXR, and the last log entry.
E. Missing or incorrect products
Products were expected but products.csv is missing, empty, or has wrong data.
Read the error messages from failed entries:
grep '"type":"failed"' extraction-log.jsonl | head -5
Common causes and fixes:
| Error pattern | Cause | Fix |
|---|---|---|
timeout / AbortError | Site is slow or blocking | Increase --delay, try with browser via --cdp-port |
403 Forbidden | Rate limiting or bot detection | Increase delay, use CDP with a real browser session |
404 Not Found | Stale sitemap, pages moved | Re-run discovery, check if site restructured |
TypeError: fetch failed | Network issue, wrong protocol | Check if site uses http vs https, check DNS |
Navigation failed | Playwright can't load the page | Check if site requires JavaScript, cookies, or auth |
Probe a failed URL manually:
curl -sI <failed-url> | head -20
Check: status code, redirects, Content-Type, security headers.
Deep browser probe (if the user has Chrome with CDP running):
Call liberate_probe with the CDP port and site URL. This connects to the browser and reports:
_BLOG_DATA, Shopify: Shopify.*, Squarespace: __NEXT_DATA__, Wix: __WIX_DATA__)This is especially useful for:
Check if the platform is detected correctly:
npx tsx src/cli.ts inspect <site-url>
If detection is wrong, the wrong adapter is running.
Group failures by error type:
grep '"type":"failed"' extraction-log.jsonl | jq -r .error | sort | uniq -c | sort -rn
Spot-check the worst offenders — fetch the URL manually and compare against what the adapter tried to do.
Check for pattern: Are all failures the same URL type (e.g. all blog posts fail but pages succeed)? This points to a type-specific extraction bug.
Run /qa to compare WXR content against the origin site. This gives per-page quality grades.
Read a few low-scoring pages from the WXR:
Check the adapter's content selector. Each adapter targets specific HTML containers:
?format=json API or admin API via CDP.w-richtext containersarticle or .rte containerswindow._BLOG_DATA and convert Draft.js post.fullContent to HTML; pages strip HEADER_SECTION / FOOTER_* / section-title / hero-image widgets from the DOMIf the site uses a non-standard template, the selector may miss the content.
Fetch the origin page and inspect its structure:
curl -s <page-url> | grep -o '<main\|<article\|class="content\|class="post-body\|class="entry-content' | head -10
.liberation-lock in the output directory means the process didn't clean up.</channel></rss>).--resume.Check if products.csv and products.jsonl exist:
ls -la output/<site>/products.csv output/<site>/products.jsonl
If both are missing — no products were detected during extraction. Investigate:
@type: Product? Fetch a product page and check:
curl -s <product-url> | grep -o 'application/ld+json' | head -3
curl -s <product-url> | grep -o '"@type":"Product"'
extractProduct function in its adapter.product type? Check classifyUrl in src/lib/extraction/sitemap.ts for the URL patterns it recognizes.If products.jsonl exists but products.csv is missing or empty — the JSONL→CSV conversion failed. Read products.jsonl to check data quality:
head -3 output/<site>/products.jsonl | jq .
Check: do products have names? Prices? Are fields malformed?
If products.csv exists but data is wrong:
offers array may be structured differently than expected. Fetch a product page and inspect the JSON-LD.ld.image as strings, objects with .url, or in a different field./adapt).extractProduct is passed to runExtractionLoop alongside the generic fallback.Check product count vs expectations:
wc -l output/<site>/products.jsonl
grep -c '"type":"product"' output/<site>/extraction-log.jsonl || echo "no product type in log"
Based on the diagnosis:
If the content selector is wrong for this site's template:
extractPage function--resumeIf the issue is rate limiting, timeouts, or auth:
--delay value--cdp-port with an authenticated browser session--token if the platform supports API keysIf the WXR has issues but re-extraction isn't needed:
/qa to identify and patch specific content gapsAfter applying fixes:
--resume (only re-processes failed URLs)/qa to check content qualityIf you discovered a platform-specific issue or workaround:
DISCOVERIES.md entry# Overview of extraction results
wc -l output/<site>/extraction-log.jsonl
grep -c '"processed"' output/<site>/extraction-log.jsonl
grep -c '"failed"' output/<site>/extraction-log.jsonl
# Most common errors
grep '"failed"' output/<site>/extraction-log.jsonl | grep -o '"error":"[^"]*"' | sort | uniq -c | sort -rn
# Slowest pages
grep '"processed"' output/<site>/extraction-log.jsonl | grep -o '"durationMs":[0-9]*' | sort -t: -k2 -rn | head -10
# Check WXR size and item count
wc -c output/<site>/output.wxr
grep -c '<item>' output/<site>/output.wxr
# Check media downloads
ls output/<site>/media/ | wc -l
grep -c '"media_failed"' output/<site>/extraction-log.jsonl
# Check if extraction is complete
test -f output/<site>/.discovery-complete && echo "Complete" || echo "Incomplete"
# Product diagnostics
wc -l output/<site>/products.jsonl 2>/dev/null || echo "No products.jsonl"
wc -l output/<site>/products.csv 2>/dev/null || echo "No products.csv"
head -3 output/<site>/products.jsonl 2>/dev/null | python3 -m json.tool 2>/dev/null || true