From majestic-data
Analyzes data source characteristics like volume patterns, update frequency, schema stability, quality baselines for ETL pipeline planning. Outputs YAML extraction specs.
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-dataThis skill is limited to using the following tools:
**Audience:** Data engineers planning ETL pipelines.
Profiles unfamiliar datasets: schema structure, column distributions, data quality, null rates, cardinality, outliers, table relationships, temporal coverage. Use for onboarding, auditing freshness, discovering foreign keys.
Profiles tables or files (CSV, Excel, Parquet, JSON) to reveal shape, null rates, column distributions, top values, percentiles, data quality issues, and column categories.
Generates detailed profiles of database tables including metadata, row counts, column statistics, cardinality analysis, sample data, and quality checks for completeness, uniqueness, and freshness.
Share bugs, ideas, or general feedback.
Audience: Data engineers planning ETL pipelines.
Goal: Characterize data sources for extraction planning by analyzing volume, update patterns, schema, and quality baselines.
source:
name: customer_orders
type: postgresql
connection: orders_db
extraction:
method: incremental
key_column: updated_at
batch_size: 100000
frequency: hourly
schema:
columns:
- name: order_id
type: bigint
nullable: false
primary_key: true
- name: customer_id
type: bigint
nullable: false
foreign_key: customers.id
- name: order_date
type: date
nullable: false
- name: total_amount
type: decimal(10,2)
nullable: false
- name: status
type: varchar(20)
nullable: false
values: [pending, confirmed, shipped, delivered, cancelled]
- name: updated_at
type: timestamp
nullable: false
incremental_key: true
volume:
current_rows: 5_200_000
daily_growth: 15_000
peak_hours: [10, 14, 18]
quality:
known_issues:
- "status can be null for orders before 2023"
- "total_amount occasionally negative (refunds)"
null_rates:
customer_id: 0%
order_date: 0%
status: 0.5%
recommendations:
- "Use updated_at for incremental loads"
- "Add check constraint for status values"
- "Consider partitioning by order_date"
When analyzing multiple related sources:
| Attribute | Orders | Order_Items |
|---|---|---|
| Row count | 5.2M | 18.7M |
| Daily growth | 15K | 52K |
| Key column | order_id | item_id |
| Join key | order_id | order_id |
| Update lag | < 1 hour | < 1 hour |
Relationship: 1:N (avg 3.6 items per order) Join strategy: Hash join on order_id Load order: Orders first, then Order_Items
For API-based sources:
api_source:
name: stripe_payments
base_url: https://api.stripe.com/v1
auth: bearer_token
endpoints:
- path: /charges
method: GET
pagination: cursor
rate_limit: 100/sec
params:
created[gte]: "{last_sync}"
extraction:
strategy: cursor_pagination
page_size: 100
sync_frequency: 15min
full_refresh_weekly: true
data_characteristics:
avg_response_size: 50KB
records_per_page: 100
typical_daily_volume: 5000
error_handling:
retry_codes: [429, 500, 502, 503]
max_retries: 3
backoff: exponential