From Claude-Data-Wrangler
Reconcile a canonical upstream data file with a downstream project that has diverged — the downstream has added enrichments, renamed columns, changed types, or restructured the data, so fresh upstream rows can't be loaded incrementally without transformation. Builds a mapping between upstream and downstream representations and generates an idempotent incremental sync script that ingests only new/changed rows from upstream, applies the transformation, and appends/upserts them into the downstream dataset. Use when a project's working data has evolved beyond the source and new source data needs to flow in without clobbering the divergence.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin Claude-Data-WranglerThis skill uses the workspace's default tool permissions.
Bridge an upstream source and a downstream-project dataset that has diverged from it, then keep them in sync incrementally.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
Bridge an upstream source and a downstream-project dataset that has diverged from it, then keep them in sync incrementally.
This skill's job is to (1) map upstream↔downstream and (2) produce a repeatable script that loads only new/changed upstream rows into downstream while preserving the divergence.
upstream.col_x → downstream.col_y or upstream.col_x → (dropped) or downstream.col_z → (added by project, not in upstream)._source_* shadow columns (see below).divergent_pipe_mapping.md — columns, transformations, key, change-detection strategy, collision rules.To detect updates to rows that were already loaded, the downstream needs to remember the upstream state it last saw. Add shadow columns alongside the enriched downstream fields:
_source_ingested_at — timestamp of last successful ingest for this row._source_row_hash — hash of the upstream columns that map into downstream (exclude enrichment-only downstream columns)._source_primary_key — the upstream key (if different from downstream PK).If the downstream is missing these, generate them from the current upstream once (best-effort reconstruction) and flag rows where reconstruction is ambiguous.
Emit sync_from_upstream.py (or .sh wrapping SQL for DB-backed downstreams). Structure:
1. Read upstream (full or since-last-sync via a watermark).
2. Compute the row hash per upstream row.
3. Left-join against downstream on the upstream key:
- no match -> new row: apply transformation, insert.
- match, same hash -> skip.
- match, diff hash -> update: apply transformation to *upstream fields only*,
preserve downstream enrichments and manual edits.
4. For keys present downstream but absent upstream:
- soft-delete (flag a `_source_deleted_at` column) or skip, per policy.
5. Update _source_row_hash and _source_ingested_at on touched rows.
6. Write a run log: counts of inserted / updated / skipped / soft-deleted / errored.
The script must be:
--dry-run prints the plan without writing.CONVENTIONS.md, confirm a downstream backup or create one before first real run.--dry-run mode and show the user the planned counts.CHANGELOG.md (via add-changelog).pip install pandas pyarrow
# if downstream lives in a SQL DB
pip install SQLAlchemy psycopg[binary] # or mysql / sqlite etc.
This skill writes into a dataset that represents real project investment. Follow CONVENTIONS.md rigorously: