Transform data files between formats or apply transformations.
Converts data files between formats and applies structural transformations.
/plugin marketplace add majesticlabs-dev/majestic-marketplace/plugin install majestic-data@majestic-marketplace<source> --to <target> [--flatten] [--schema]workflows/Transform data files between formats or apply structural transformations.
Input: $ARGUMENTS
Parse arguments:
--to <path>: Output file path (format inferred from extension)--flatten: Flatten nested JSON structures--schema <path>: Apply schema/type casting--columns <list>: Select specific columns--filter <expr>: Filter rows by expression| From | To | Notes |
|---|---|---|
| CSV | Parquet | Recommended for analytics |
| CSV | JSON | Lines format |
| JSON | CSV | Flattens if needed |
| JSON | Parquet | With optional flatten |
| Parquet | CSV | For compatibility |
| Excel | CSV/Parquet | Specify sheet with --sheet |
Determine source format and structure:
source_ext = Path(source).suffix.lower()
target_ext = Path(target).suffix.lower()
if source_ext == '.json':
# Check if nested
sample = json.loads(Path(source).read_text()[:10000])
is_nested = any(isinstance(v, (dict, list)) for v in sample[0].values())
Simple format conversion:
if source_ext == '.csv' and target_ext == '.parquet':
df = pd.read_csv(source)
df.to_parquet(target, compression='snappy', index=False)
With flattening (JSON):
Task(subagent_type="majestic-data:transform:json-flattener",
prompt="Flatten nested JSON and convert to target format")
With schema application:
Task(subagent_type="majestic-data:research:schema-discoverer",
prompt="Infer and apply schema, then convert")
Run basic validation on output:
# Transform Complete
**Source:** events.json (1.2 GB, 5.2M rows)
**Target:** events.parquet
## Transformation Applied
- Format: JSON → Parquet
- Flattening: Yes (depth 3 → flat)
- Compression: Snappy
## Column Mapping
| Original Path | Output Column | Type |
|---------------|---------------|------|
| id | id | int64 |
| user.id | user_id | int64 |
| user.email | user_email | string |
| event.type | event_type | string |
| event.properties | properties_json | string |
| timestamp | timestamp | datetime64 |
## Metrics
| Metric | Value |
|--------|-------|
| Source size | 1.2 GB |
| Output size | 245 MB |
| Compression ratio | 4.9x |
| Source rows | 5,200,000 |
| Output rows | 5,200,000 |
| Duration | 45 seconds |
## Notes
- Nested `event.properties` kept as JSON string (too varied to flatten)
- Timestamps converted from Unix epoch to datetime
- 3 columns with >99% nulls dropped: legacy_field_1, legacy_field_2, debug_info
# Simple CSV to Parquet
/data:transform orders.csv --to orders.parquet
# Flatten nested JSON
/data:transform events.json --to events.parquet --flatten
# Select specific columns
/data:transform data.csv --to subset.parquet --columns "id,name,value"
# Filter rows
/data:transform orders.csv --to active.parquet --filter "status == 'active'"
# Excel to CSV (specific sheet)
/data:transform report.xlsx --to data.csv --sheet "Q4 Data"