Infers semantic ontology from working directory files: entities, attributes, relationships, taxonomy, join paths. Profiles tabular data (CSV/Excel/Parquet) via qsv tools. Outputs ONTOLOGY.md.
From qsv-data-wranglingnpx claudepluginhub dathere/qsv --plugin qsv-data-wranglingThis skill is limited to using the following tools:
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Scan all files in the current working directory, profile each one, then synthesize a semantic ontology describing the entities, their attributes, the relationships between files, and the domain taxonomy.
Cowork note: If relative paths don't resolve, call
qsv_get_working_dirandqsv_set_working_dirto sync the working directory.
Sync working directory: Call qsv_get_working_dir to confirm the current path. If needed, call qsv_set_working_dir.
List files: Call qsv_list_files to get all files in the working directory. Classify each file:
Tabular (handled natively by qsv MCP Server — auto-converted to CSV if needed):
.csv, .tsv, .ssv, .tab and .sz compressed variants).xlsx, .xls, .xlsm, .xlsb, .ods).jsonl, .ndjson).parquet, .pq)Non-tabular (best-effort extraction):
.md), Text (.txt), README files — read for domain contextRun data-profile on each tabular file: For every tabular file discovered in step 2, execute the full /data-profile workflow (steps 1-13). This produces for each file:
.stats.csv cache)Run profiles sequentially to avoid overwhelming the MCP server. After each profile completes, retain the key outputs (stats, frequencies, Data Dictionary, Tags) for cross-file analysis.
Extract context from non-tabular files: For each non-tabular file:
Identify shared columns: Compare column names across all profiled files. Flag columns that appear in multiple files (exact name match or close variants like customer_id / cust_id / customerid).
Validate join candidates: For each pair of files sharing column names, read .stats.csv and check:
type from stats — both must be the same type; if mismatched, the relationship is invalid unless one side needs castingcardinality: a foreign key column typically has cardinality ≤ the primary key's cardinalityuniqueness_ratio — a value of 1.0 identifies the primary key side of the relationshipnullcount / sparsity — join columns with sparsity > 0.3 are unreliable join candidates (nulls don't match)min/max) — overlapping ranges suggest a real relationshipskewness — highly skewed join columns (|skewness| > 2) may indicate data quality issues masking relationshipsqsv_sqlp to test overlap when needed:
SELECT COUNT(DISTINCT a.col) as overlap
FROM read_csv('file1.csv') a
INNER JOIN read_csv('file2.csv') b ON a.col = b.col
Detect relationship types: Classify each validated relationship:
Define entities: Each tabular file represents one entity (or multiple if it's a denormalized/wide table). For each entity:
customers.csv → Customer)Define relationships: From the cross-file analysis (steps 5-7), document each relationship:
Infer domain taxonomy: Using all collected information (column names, value distributions, Tags from each file, non-tabular file context):
Assess data quality across the collection: Summarize cross-cutting quality concerns:
# Ontology: {Domain Name}
> Auto-generated ontology inferred from {N} files in `{working_directory}`.
> Generated: {date}
## Overview
{3-5 sentence summary of the dataset collection: what domain it covers, how many
entities, total rows across all files, key relationships, and overall data quality.}
## Domain Taxonomy
**Domain**: {Primary domain}
**Subdomains**: {Comma-separated list}
**Temporal scope**: {Earliest date} to {Latest date} (if applicable)
**Tags**: {Merged and deduplicated tags from all file profiles}
## Entities
### {Entity Name} (`{filename}`)
{Dataset Description from data-profile}
| Field | Type | Label | Description | Nullable | Cardinality | Key |
|-------|------|-------|-------------|----------|-------------|-----|
| ... | ... | ... | ... | ... | ... | PK/FK/— |
**Quality notes**: {Any quality flags from profiling — PII, high nulls, injection, etc.}
{Repeat for each entity}
## Relationships
| Source | Target | Source Column | Target Column | Type | Overlap | Join Expression |
|--------|--------|---------------|---------------|------|---------|-----------------|
| ... | ... | ... | ... | 1:N | 98.5% | `joinp --left file1.csv file2.csv --columns col` |
### Relationship Diagram
{ASCII or text-based entity-relationship diagram showing entities as boxes
and relationships as labeled arrows. Example:}
```
[Customer] 1──→N [Order] N←──1 [Product]
│ │
└──────── N:M ─────────────────┘
(via OrderItem)
```
## Controlled Vocabularies
{For columns with low cardinality that serve as classification dimensions}
### {Column Name} (`{filename}`)
| Value | Frequency | Description |
|-------|-----------|-------------|
| ... | ... | ... |
## Cross-Collection Quality Summary
| Dimension | Status | Details |
|-----------|--------|---------|
| Naming consistency | {OK/Warning} | {Details} |
| Referential integrity | {OK/Warning} | {Orphaned keys found in ...} |
| PII/PHI exposure | {OK/Warning} | {Columns flagged in ...} |
| Type consistency | {OK/Warning} | {Mismatched types for ...} |
| Completeness | {OK/Warning} | {High-null columns in ...} |
## Non-Tabular Context
{Summary of information extracted from non-tabular files that informed the ontology}
| File | Type | Contribution |
|------|------|-------------|
| ... | ... | {What domain context or definitions it provided} |
sqlp with read_parquet() rather than converted to CSV