From data-liberation
Guides building content extraction adapters for unsupported platforms (Blogger, Ghost, Weebly) via site reconnaissance, detection signals, API mapping, HTML parsing, and browser rendering.
npx claudepluginhub automattic/data-liberation-agent --plugin data-liberationThis skill is limited to using the following tools:
Guide the process of adding extraction support for a new platform. The result is a working adapter that plugs into the existing extraction pipeline.
Extracts content from closed platforms (GoDaddy, Hostinger, HubSpot, Shopify, Squarespace, Webflow, Wix) into WordPress WXR files. Exports products to WooCommerce CSV; supports discovery, verification, resume, and import.
Validates adapter files against live platform documentation to detect stale mappings, missing features, and outdated version information. Supports optional in-place updates via --update flag.
Share bugs, ideas, or general feedback.
Guide the process of adding extraction support for a new platform. The result is a working adapter that plugs into the existing extraction pipeline.
src/adapters/ — if an adapter exists, this skill isn't needed.Understand how the target platform works before writing any code.
Figure out how to identify sites on this platform. Check:
.squarespace.com, .webflow.io, .wixsite.com)X-Squarespace-Version, X-Wix-Request-Id)Add detection signals to src/lib/extraction/detect-platform.ts:
URL_PATTERNSdetectFromHttp()Figure out how to find all pages on the site:
sitemap.xml, sitemap_index.xml. Most platforms generate these.extractNavLinks() in src/adapters/shared.ts handles this generically.?format=json or Shopify's /products.json).Figure out how to get the actual content from each page:
.post-body, article, .content, main).launchBrowser() from src/adapters/shared.ts.If the platform has an admin dashboard or uses client-side API calls, use liberate_map_apis to automatically discover all API endpoints:
--remote-debugging-port=9222 and log in to their account on the target platformliberate_map_apis with the CDP port, the site URL, and optionally a list of admin dashboard URLs to crawlThis is the fastest way to reverse-engineer a platform's API surface. The output tells you exactly which endpoints return content data, what auth is needed, and what the response shapes look like — everything you need to write the adapter's extractPage function.
You can also call liberate_probe to inspect window globals, localStorage, cookies, and platform identity fields on any page — useful for understanding what data the platform exposes client-side.
Document everything you find. This is research — take notes on endpoints, selectors, quirks.
Create src/adapters/<platform>.ts following the existing pattern. Read src/adapters/webflow.ts as a reference — it's the simplest adapter.
Every adapter must:
<platform>Adapter object implementing PlatformAdapter from src/types.ts<Platform>AdapterOpts extending Record<string, unknown> with: delay?, resume?, dryRun?, verbose?, outputDir?<Platform>Inventory with: siteUrl, discoveredAt, siteMeta (title, tagline, language), navigation, counts, urlsRequired methods:
id — lowercase platform name (e.g. 'ghost')detect(url) — return true if the URL belongs to this platformdiscover(url, opts) — fetch sitemap + navigation, classify URLs, return inventoryextract(inventory, wxr, opts, context) — call runExtractionLoop() from src/adapters/shared.ts with an extractPage functionThis is where platform-specific extraction lives. For each URL:
ExtractedPage object (defined in src/adapters/shared.ts)Use the shared helpers from src/adapters/shared.ts:
extractMeta(html, property) — read meta tagsextractTitle(html) — read <title> tagextractHeading(html) — read <h1> with title fallbackextractNavLinks(html, baseUrl) — parse nav linksIMAGE_EXTENSIONS — regex for image file detectionCheck during reconnaissance whether the platform has e-commerce (product pages, a store, a shop section).
Generic detection (automatic): The shared extraction loop in src/adapters/shared.ts automatically detects products via JSON-LD @type: Product on any page classified as product type. This works out of the box if:
Platform-specific detection (optional but recommended): If the platform has a richer product API or non-standard product markup, provide a custom extractProduct function to runExtractionLoop():
const result = await runExtractionLoop({
// ...other opts
csvBuilder,
extractProduct: (url: string, html: string) => {
// Try platform-specific product extraction first
// Return WooProduct or null
},
});
The custom extractor is called before the generic JSON-LD fallback, so it takes priority.
What to extract for products (see WooProduct type in src/lib/import/woo-product-csv.ts):
name (required), description, shortDescriptionregularPrice, salePriceskuimages — array of image URLscategories, tagsweight, length, width, heightinStock, stockattributes — array of { name, values[], visible, global } for product options (size, color, etc.)type — 'simple', 'variable', 'grouped', 'external', or 'variation'parentSku — for variations, the parent product's SKUVariable products: If the platform supports product variants (sizes, colors), generate one variable parent row plus variation child rows with parentSku linking them. See shopifyProductToWoo() in src/adapters/shopify.ts for the pattern.
CSV streaming: The adapter should create a WooProductCsvBuilder, call openStream(outputDir) before extraction, and closeStream() after. The shared loop calls csvBuilder.addProduct() automatically when it detects products. See the Shopify or Wix adapters for the wiring pattern.
src/mcp-server.ts and add it to the adapters arraysrc/ui/discover.tsx and add to the adapters arraysrc/ui/inspect.tsx and add to the adapters arrayCreate fixture files in test/fixtures/ with sample HTML and/or JSON from the platform. Sanitize any PII.
Create test/adapters/<platform>.test.ts. Test:
Run extraction against the user's live site:
npx tsx src/cli.ts <site-url> --dry-run --verbose
Check the output for quality: are titles correct? Is content complete? Are media URLs captured?
README.mdDISCOVERIES.md documenting what you learned about the platformAGENTS.md if any non-obvious details are worth noting