Plan and execute data migrations — ETL pipelines, zero-downtime cutover, validation, rollback strategies, and legacy system integration
npx claudepluginhub cure-consulting-group/productengineeringskillsThis skill uses the workspace's default tool permissions.
Before starting, gather project context silently:
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Before starting, gather project context silently:
PORTFOLIO.md if it exists in the project root or parent directories for product/team contextcat package.json 2>/dev/null || cat build.gradle.kts 2>/dev/null || cat Podfile 2>/dev/null to detect stackgit log --oneline -5 2>/dev/null for recent changesls src/ app/ lib/ functions/ 2>/dev/null to understand project structurePlans and executes production-grade data migrations across databases, cloud platforms, and legacy systems. Covers ETL pipeline design, zero-downtime cutover patterns, validation frameworks, rollback strategies, and Firestore-specific migration tooling. Every migration is reversible, validated, and monitored.
Hard rules:
| Type | Characteristics | Typical Duration | Risk Level |
|---|---|---|---|
| Schema Evolution | Same database, structural changes (add/rename/drop columns) | Minutes–hours | Low–Medium |
| Database-to-Database | Same engine, different instance (e.g., dev → prod restore) | Hours | Medium |
| Platform Switch | Different engines (MySQL → PostgreSQL, MongoDB → Firestore) | Days–weeks | High |
| Cloud Migration | On-premise → cloud or cloud → cloud | Weeks–months | High |
| Legacy System Integration | Mainframe, CSV exports, SOAP APIs → modern stack | Weeks–months | Very High |
| Firestore Restructuring | Collection/document model changes within Firestore | Hours–days | Medium–High |
Before planning, confirm:
When to use:
- Small datasets (<1M records)
- Short maintenance window available (< 4 hours)
- Simple schema mapping (1:1 field correspondence)
- Non-critical system or acceptable downtime
Process:
1. Announce maintenance window
2. Stop all writes to source
3. Run full export from source
4. Transform data
5. Load into target
6. Validate (row counts, checksums, spot checks)
7. Switch application to target
8. Monitor for 1 hour
9. Decommission source after 7-day hold
Rollback: switch application back to source (source is read-only during hold period)
Risk: all-or-nothing — failure means full rollback
When to use:
- Large datasets (>1M records)
- Zero-downtime required
- Source system supports CDC (PostgreSQL WAL, MySQL binlog, Firestore listeners)
Process:
1. Set up CDC pipeline (Debezium, Firestore onSnapshot, PostgreSQL logical replication)
2. Run initial bulk load of historical data
3. CDC catches up on changes made during bulk load
4. Validate source-target consistency
5. Switch reads to target (behind feature flag)
6. Switch writes to target
7. Decommission CDC pipeline after 7-day monitoring period
Rollback: reverse CDC direction or switch reads/writes back to source
Risk: CDC lag can cause temporary inconsistency; monitor lag continuously
When to use:
- Critical systems where cutover must be instant
- Both source and target can run simultaneously
- Application supports database connection switching
Process:
1. Stand up target (green) alongside source (blue)
2. Sync data from blue → green (initial + ongoing CDC)
3. Run validation suite against green
4. Switch traffic to green (DNS, connection string, feature flag)
5. Monitor green for anomalies
6. Keep blue running for 7 days as fallback
7. Decommission blue
Rollback: switch traffic back to blue (instant)
Risk: cost of running two systems simultaneously; data sync complexity
When to use:
- Legacy system replacement over months
- Gradual feature-by-feature migration
- Cannot afford big-bang risk
Process:
1. Identify migration order (lowest risk features first)
2. For each feature:
a. Build new data model in target
b. Implement dual-read (read from target, fallback to source)
c. Implement dual-write (write to both)
d. Migrate historical data for this feature
e. Validate feature data in target
f. Switch to target-only reads
g. Stop writing to source for this feature
3. Repeat until all features migrated
4. Decommission source
Rollback: per-feature rollback by reverting to dual-read/source-only
Risk: long duration increases complexity; dual-write bugs can cause divergence
Firestore:
- Full export: gcloud firestore export gs://bucket/path
- Streaming: onSnapshot listeners for real-time CDC
- Batch read: getAll() with pagination (500 docs per batch)
- BigQuery export: scheduled export via Data Transfer Service
PostgreSQL:
- Full dump: pg_dump --format=custom --compress=9
- Logical replication: CREATE PUBLICATION / CREATE SUBSCRIPTION
- COPY command: COPY table TO STDOUT WITH CSV HEADER
- CDC: Debezium connector reading WAL
Legacy systems:
- CSV/Excel: parse with streaming reader (not full-file load into memory)
- SOAP API: paginated requests with exponential backoff
- FTP drops: scheduled pickup with file validation (checksum, row count header)
- Direct database: read-only replica connection with query timeout
Mapping document template:
┌─────────────────────┬──────────────────────┬──────────────────────────┐
│ Source Field │ Target Field │ Transformation │
├─────────────────────┼──────────────────────┼──────────────────────────┤
│ user.first_name │ users.displayName │ CONCAT(first, ' ', last) │
│ user.created_date │ users.createdAt │ PARSE_DATE('MM/dd/yyyy') │
│ order.total_cents │ orders.totalAmount │ DIVIDE by 100, DECIMAL │
│ user.status = 'A' │ users.isActive │ MAP('A'→true, else false)│
│ (no equivalent) │ users.migratedAt │ CURRENT_TIMESTAMP │
│ user.ssn │ (do not migrate) │ SKIP — not needed │
└─────────────────────┴──────────────────────┴──────────────────────────┘
Transformation rules:
- NULL handling: define explicit default for every nullable source field
- Type coercion: document every type change (string→number, date format changes)
- Encoding: normalize to UTF-8; detect and convert from source encoding
- Deduplication: define merge strategy for duplicate source records
- Derived fields: document calculation formulas for computed fields
- Anonymization: PII fields that should not migrate to non-production targets
Batch loading:
- Firestore: batched writes (max 500 ops per batch), with retry and exponential backoff
- PostgreSQL: COPY command for bulk inserts (10-100x faster than INSERT)
- Batch size: 1000–5000 records; tune based on target system capacity
Streaming loading:
- Process source changes in near-real-time
- Buffer writes to reduce target system load (batch micro-writes)
- Monitor lag between source event and target write
Idempotency rules:
- Use deterministic IDs (hash of source primary key + migration version)
- Use UPSERT / ON CONFLICT for SQL targets
- Use set() with merge for Firestore targets
- Log every record processed; support resumption from last checkpoint
Before starting:
1. Source data quality audit:
- [ ] NULL percentage per column (flag >10% unexpected NULLs)
- [ ] Data type consistency (strings in numeric columns, invalid dates)
- [ ] Referential integrity (orphaned foreign keys)
- [ ] Duplicate detection (primary key uniqueness, business key uniqueness)
- [ ] Value range validation (negative ages, future dates, impossible amounts)
2. Schema compatibility check:
- [ ] All source fields have a target mapping or explicit skip justification
- [ ] Target schema can accommodate maximum source field lengths
- [ ] Target constraints (NOT NULL, UNIQUE, CHECK) are satisfiable by source data
- [ ] Character encoding is compatible
Real-time monitoring:
- Record count: source extracted vs. target loaded (should match or have documented delta)
- Error rate: failures per batch (alert if >0.1%)
- Throughput: records/second (alert if drops below baseline by >50%)
- Lag: time between source change and target write (for CDC migrations)
- Checkpointing: last successfully processed record ID (for resumption)
Error handling:
- Dead letter queue for failed records (do not skip silently)
- Categorize errors: data quality, network, capacity, permission
- Auto-retry transient errors (network, rate limit) with exponential backoff
- Halt migration if error rate exceeds threshold (configurable, default 1%)
Completeness checks:
- [ ] Row/document count matches (source vs. target, per table/collection)
- [ ] Checksum comparison on critical columns (SUM, HASH of concatenated values)
- [ ] Null count comparison per column
Accuracy checks:
- [ ] Random sample validation (1% or 1000 records, whichever is larger)
- [ ] Business rule validation (calculated fields produce correct results)
- [ ] Referential integrity in target (no orphaned foreign keys)
- [ ] Date/time values preserved correctly (timezone handling)
Functional checks:
- [ ] Application smoke tests pass against target database
- [ ] Critical user flows work end-to-end
- [ ] Reporting queries produce same results against target
- [ ] Search/filter functionality returns expected results
Before migration:
1. Take a labeled backup of the target database
- PostgreSQL: pg_dump with timestamp label
- Firestore: gcloud firestore export with date prefix
- Room/SQLite: copy database file before migration object runs
2. Record the pre-migration state:
- Row counts per table/collection
- Schema version identifier
- Application version deployed
- Feature flag states
Rollback execution:
1. Stop writes to target
2. Restore from pre-migration backup
3. Revert application to previous version (or toggle feature flag)
4. Verify restored state matches pre-migration records
5. Resume normal operations
6. Conduct post-mortem on migration failure
Recovery Time Objective (RTO): document and test — target <30 minutes for critical systems
Recovery Point Objective (RPO): zero data loss during rollback (backup is pre-migration state)
During dual-write period:
- Both source and target receive all writes
- Reads default to source (target is warming up)
- Validation runs continuously comparing source and target
Rollback trigger conditions:
- Error rate >1% on target writes
- Data divergence detected between source and target
- Application latency increases >50% due to dual-write overhead
- Target system instability (connection failures, timeouts)
Rollback steps:
1. Disable dual-write (stop writing to target)
2. Revert read path to source-only
3. Target becomes stale but source is authoritative
4. No data loss — source was always receiving all writes
Migration phases controlled by feature flags:
FF: migration_read_from_target (default: false)
FF: migration_write_to_target (default: false)
FF: migration_dual_write (default: false)
Rollout sequence:
Phase 1: migration_dual_write = true (write to both)
Phase 2: migration_read_from_target = true (read from target, write to both)
Phase 3: migration_write_to_target = true (write to target only)
Phase 4: Remove flags, decommission source
Rollback: set all flags to false → instant revert to source-only
Application Layer:
┌─────────────┐
│ App Server │
│ / Client │
└──────┬───────┘
│ write
┌────┴────┐
▼ ▼
┌────────┐ ┌────────┐
│ Source │ │ Target │
│ DB │ │ DB │
└────────┘ └────────┘
Implementation:
- Write to source first (authoritative)
- Write to target second (async or sync depending on consistency needs)
- If target write fails: log to retry queue, do NOT fail the user request
- Read from source until validation confirms target consistency
- Gradual read traffic shift: 0% → 10% → 50% → 100% target
┌────────┐ ┌─────────┐ ┌───────────┐ ┌────────┐
│ Source │───▶│ CDC │───▶│ Transform │───▶│ Target │
│ DB │ │ (Debezium│ │ Service │ │ DB │
└────────┘ │ / WAL) │ └───────────┘ └────────┘
└─────────┘
│
▼
┌────────────┐
│ Dead Letter │
│ Queue │
└────────────┘
PostgreSQL CDC setup:
ALTER SYSTEM SET wal_level = logical;
CREATE PUBLICATION migration_pub FOR TABLE users, orders, products;
-- On target:
CREATE SUBSCRIPTION migration_sub
CONNECTION 'host=source dbname=app'
PUBLICATION migration_pub;
Firestore CDC:
// Cloud Function triggered on document changes
exports.syncToTarget = functions.firestore
.document('{collection}/{docId}')
.onWrite(async (change, context) => {
// Transform and write to target
});
Scenario: flattening nested subcollections into top-level collections
Before:
users/{userId}/orders/{orderId} (subcollection)
After:
orders/{orderId} (top-level, with userId field)
Migration script (Cloud Function):
const BATCH_SIZE = 500;
async function migrateOrders() {
const usersSnap = await db.collection('users').get();
for (const userDoc of usersSnap.docs) {
const ordersSnap = await db
.collection('users').doc(userDoc.id)
.collection('orders').get();
let batch = db.batch();
let count = 0;
for (const orderDoc of ordersSnap.docs) {
const newRef = db.collection('orders').doc(orderDoc.id);
batch.set(newRef, {
...orderDoc.data(),
userId: userDoc.id,
migratedAt: admin.firestore.FieldValue.serverTimestamp(),
}, { merge: true }); // merge for idempotency
count++;
if (count % BATCH_SIZE === 0) {
await batch.commit();
batch = db.batch();
}
}
if (count % BATCH_SIZE !== 0) {
await batch.commit();
}
}
}
Scenario: converting string timestamps to Firestore Timestamps
Migration approach (expand-contract):
Phase 1: Add new field (createdAtTs: Timestamp) alongside old field (createdAt: string)
Phase 2: Backfill new field from old field
Phase 3: Update application to write both, read new
Phase 4: Drop old field in cleanup migration
Backfill script:
async function backfillTimestamps(collectionName: string) {
let lastDoc = null;
let processed = 0;
while (true) {
let query = db.collection(collectionName)
.where('createdAtTs', '==', null)
.limit(500);
if (lastDoc) {
query = query.startAfter(lastDoc);
}
const snap = await query.get();
if (snap.empty) break;
const batch = db.batch();
for (const doc of snap.docs) {
const dateStr = doc.data().createdAt;
const timestamp = admin.firestore.Timestamp.fromDate(new Date(dateStr));
batch.update(doc.ref, { createdAtTs: timestamp });
}
await batch.commit();
lastDoc = snap.docs[snap.docs.length - 1];
processed += snap.size;
console.log(`Processed ${processed} documents`);
}
}
Rules for Firestore backfill scripts:
1. Always paginate with startAfter — never load entire collection
2. Use batched writes (max 500 operations per batch)
3. Add a migratedAt timestamp to every modified document
4. Support resumption: log last processed document ID
5. Rate limit: add delays between batches if hitting Firestore write limits
6. Dry run mode: first run should log what would change without writing
7. Validation: count documents before and after, verify sample
8. Idempotent: use set with merge or check-before-write patterns
9. Run during low-traffic windows (check Firebase console for traffic patterns)
10. Monitor Firestore usage dashboard during execution
Immediately after cutover:
- [ ] Row/document counts match expected values
- [ ] Application health checks passing
- [ ] Error rate at or below pre-migration baseline
- [ ] Latency at or below pre-migration baseline
- [ ] All critical user flows tested manually
- [ ] Search and filter results verified
- [ ] Reporting queries producing correct output
24 hours after cutover:
- [ ] No data drift detected (scheduled consistency checks)
- [ ] Backup of target database confirmed
- [ ] No increase in support tickets or error reports
- [ ] CDC pipeline (if used) lag is consistently <5 seconds
7 days after cutover:
- [ ] Source system decommission decision made
- [ ] Migration scripts archived in version control
- [ ] Documentation updated (architecture diagrams, runbooks)
- [ ] Lessons learned captured
Metrics to watch post-migration:
- Query latency (p50, p95, p99) — compare to pre-migration baseline
- Error rate per endpoint — compare to pre-migration baseline
- Database CPU/memory/IOPS — watch for capacity issues
- Connection pool utilization — watch for exhaustion
- Replication lag (if applicable) — alert if >10 seconds
- Firestore read/write/delete counts — watch for unexpected spikes
- Application-level health checks — all green
Alert thresholds:
- Latency p95 >2x pre-migration baseline → page on-call
- Error rate >1% → page on-call
- Replication lag >30 seconds → alert migration team
- Database CPU >80% sustained → alert infrastructure team
DATA MIGRATION REPORT
Migration: [SOURCE] → [TARGET]
Date: [TODAY]
Engineer: [NAME]
MIGRATION SUMMARY
┌──────────────────────┬────────────────────────────────────┐
│ Field │ Value │
├──────────────────────┼────────────────────────────────────┤
│ Migration Type │ [From Step 1 classification] │
│ Strategy │ [From Step 3 selection] │
│ Total Records │ [count] │
│ Duration │ [HH:MM] │
│ Downtime │ [HH:MM or "zero"] │
│ Error Rate │ [percentage] │
│ Rollback Tested │ [Yes/No] │
│ Validation Passed │ [Yes/No — with details] │
└──────────────────────┴────────────────────────────────────┘
DELIVERABLES GENERATED:
- [ ] Migration strategy document
- [ ] ETL pipeline (extract, transform, load scripts)
- [ ] Field mapping document
- [ ] Validation framework (pre, during, post checks)
- [ ] Rollback plan (tested)
- [ ] Monitoring dashboard configured
- [ ] Post-migration verification completed
CROSS-REFERENCES:
- /database-architect — for schema design and indexing strategy
- /firebase-architect — for Firestore-specific patterns and security rules
- /infrastructure-scaffold — for cloud infrastructure provisioning
- /incident-response — for migration failure runbook
Generate migration infrastructure using Write:
migrations/{timestamp}_{name}.ts with up/down functionsscripts/validate-migration.ts — pre/post migration data integrity checksscripts/rollback-migration.ts — reverses last applied migrationscripts/backup-before-migration.sh — snapshot before migration.github/workflows/migration-test.yml — runs migration against test DBBefore generating, Glob for existing migrations (**/migrations/**) and Read them to match format.