Install
1
Install the plugin$
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-dataWant just this skill?
Add to a custom plugin, then install with one command.
Description
Analyze data source characteristics including update frequency, volume patterns, and schema stability.
Tool Access
This skill is limited to using the following tools:
Read Grep Glob Bash
Skill Content
Source Analyzer
Audience: Data engineers planning ETL pipelines.
Goal: Characterize data sources for extraction planning by analyzing volume, update patterns, schema, and quality baselines.
Analysis Dimensions
Volume Analysis
- Current row count
- Historical growth rate (if multiple snapshots available)
- Projected growth
- Peak vs average volume
- Seasonality patterns
Update Patterns
- Full refresh vs incremental
- Update frequency (hourly, daily, weekly)
- Batch arrival times
- Late-arriving data windows
Schema Characteristics
- Column count and types
- Nullable patterns
- Primary key candidates
- Foreign key relationships
- Schema change history
Data Quality Baseline
- Typical null rates per column
- Expected value ranges
- Cardinality expectations
- Known data issues
Workflow
- Inventory the source - File location/database connection, access patterns (API, file drop, DB query), authentication requirements
- Sample multiple time periods - Analyze data from different dates, identify temporal patterns
- Profile the schema - Document all columns, identify stable vs volatile columns, note computed/derived columns
- Assess quality - Baseline quality metrics, known quirks, required transformations
- Document extraction requirements - Optimal extraction method, incremental key columns, filtering criteria
Output: Source Specification
source:
name: customer_orders
type: postgresql
connection: orders_db
extraction:
method: incremental
key_column: updated_at
batch_size: 100000
frequency: hourly
schema:
columns:
- name: order_id
type: bigint
nullable: false
primary_key: true
- name: customer_id
type: bigint
nullable: false
foreign_key: customers.id
- name: order_date
type: date
nullable: false
- name: total_amount
type: decimal(10,2)
nullable: false
- name: status
type: varchar(20)
nullable: false
values: [pending, confirmed, shipped, delivered, cancelled]
- name: updated_at
type: timestamp
nullable: false
incremental_key: true
volume:
current_rows: 5_200_000
daily_growth: 15_000
peak_hours: [10, 14, 18]
quality:
known_issues:
- "status can be null for orders before 2023"
- "total_amount occasionally negative (refunds)"
null_rates:
customer_id: 0%
order_date: 0%
status: 0.5%
recommendations:
- "Use updated_at for incremental loads"
- "Add check constraint for status values"
- "Consider partitioning by order_date"
Comparison Analysis
When analyzing multiple related sources:
| Attribute | Orders | Order_Items |
|---|---|---|
| Row count | 5.2M | 18.7M |
| Daily growth | 15K | 52K |
| Key column | order_id | item_id |
| Join key | order_id | order_id |
| Update lag | < 1 hour | < 1 hour |
Relationship: 1:N (avg 3.6 items per order) Join strategy: Hash join on order_id Load order: Orders first, then Order_Items
API Source Analysis
For API-based sources:
api_source:
name: stripe_payments
base_url: https://api.stripe.com/v1
auth: bearer_token
endpoints:
- path: /charges
method: GET
pagination: cursor
rate_limit: 100/sec
params:
created[gte]: "{last_sync}"
extraction:
strategy: cursor_pagination
page_size: 100
sync_frequency: 15min
full_refresh_weekly: true
data_characteristics:
avg_response_size: 50KB
records_per_page: 100
typical_daily_volume: 5000
error_handling:
retry_codes: [429, 500, 502, 503]
max_retries: 3
backoff: exponential
Stats
Stars30
Forks6
Last CommitMar 15, 2026
Actions