add-data-dictionary | Claude-Data-Wrangler

Stats

Actions

Tags

add-data-dictionary | Claude-Data-Wrangler

Add Data Dictionary

Generate a data dictionary file alongside a dataset.

When to invoke

Dataset has no data_dictionary.{md,yaml,json,csv} in its folder.
User asks to "document the dataset", "create a schema", "add a data dictionary".
Other skills in this plugin (text-to-numeric, add-iso3166, etc.) need a dictionary to update and find none — offer this skill.

Output format

Default to data_dictionary.md in the dataset's folder. Offer alternatives:

data_dictionary.yaml for structured/programmatic use.
data_dictionary.json for schema-validator workflows.
data_dictionary.csv for spreadsheet editing.

Markdown template

# Data Dictionary: <dataset filename>

**Source**: <file path or URL>
**Format**: <csv|json|jsonl|parquet|xlsx>
**Rows**: <count>
**Generated**: <YYYY-MM-DD>
**Generated by**: Claude-Data-Wrangler add-data-dictionary skill

## Columns

| Column | Type | Description | Units | Nullable | Example | Notes |
|---|---|---|---|---|---|---|
| country | string | Country name as supplied in source | — | No | "France" | Standardised via standardise-country-names skill |
| iso3166_alpha2 | string | ISO 3166-1 alpha-2 country code | — | No | "FR" | Derived column |
| revenue_numeric | float | Revenue parsed from original text | USD | Yes | 4270000.0 | Original format: `$4.27M`, scale suffix applied |
| ... | ... | ... | ... | ... | ... | ... |

## Provenance / Transformations

- <YYYY-MM-DD>: Source file ingested.
- <YYYY-MM-DD>: Country names standardised (skill: standardise-country-names).
- <YYYY-MM-DD>: ISO 3166 codes added (skill: add-iso3166).
- <YYYY-MM-DD>: Currency enrichment applied (skill: enrich-with-currency).

## Known issues / limitations

- <list unresolved rows, ambiguous mappings, etc.>

Procedure

Locate the dataset. Confirm path and format.
Profile each column using pandas:
- dtype
- Nullability (has any NaN?)
- Unique value count
- Sample value (first non-null)
- For numeric: min, max, mean (optional, ask user).
Ask the user for descriptions — the profile gives type and examples, but what the column means usually needs human input. Offer two modes:
- Interactive: go column-by-column, ask for description.
- Stub: generate entries with <TODO: describe> placeholders and let the user fill in later.
Include provenance — if invoked as part of a pipeline (other wrangler skills have run), log each transformation with date.
Write the dictionary file to the same folder as the dataset.
Report where it was written and remind the user to update it when the dataset changes (via update-data-dictionary).

Dependencies

pip install pandas pyarrow openpyxl

Edge cases

Nested JSON — flatten one level for the dictionary, note nesting in the description, and link to the json-restructure skill if the user wants restructuring.
Very wide datasets (>50 columns) — default to stub mode; interactive mode is tedious.
Binary/blob columns — record type as bytes and note "binary data, not profiled".