From data-annotation
Main entry point for turning raw data into a dataset. Examines a source (GitHub repo, local path, URL), works with the user to define the target task, then plans and executes a prep pipeline — profiling, PII handling, column curation, format normalization, splits, and annotation schema design. Delegates specialized work to subagents. Use when the user says "prep this data", "turn this into a dataset", "get this ready for annotation", or "build a dataset from X".
npx claudepluginhub danielrosehill/claude-code-plugins --plugin data-annotationThis skill uses the workspace's default tool permissions.
This is the top-level workflow for taking raw data and getting it into shape for a dataset — typically one destined for annotation and publication on Hugging Face. It does **not** try to enumerate every possible operation as a separate skill. Instead, it inspects the data, understands the user's goal, and decides which preparatory steps to run, delegating each to a specialized subagent.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
This is the top-level workflow for taking raw data and getting it into shape for a dataset — typically one destined for annotation and publication on Hugging Face. It does not try to enumerate every possible operation as a separate skill. Instead, it inspects the data, understands the user's goal, and decides which preparatory steps to run, delegating each to a specialized subagent.
ingest-source skill to stage the data first.$CLAUDE_USER_DATA/data-annotation/workspaces/<dataset-name>/ resolved as:DATA_ROOT="${CLAUDE_USER_DATA:-${XDG_DATA_HOME:-$HOME/.local/share}/claude-plugins}/data-annotation"
If the user wants the workspace under a path they own (e.g. ~/repos/...), store only the pointer to that path in $DATA_ROOT/config.json and write artifacts there.
If not already on disk, invoke the ingest-source skill. End state: a single directory with the raw data files visible.
Delegate to the data-profiler subagent. Expect back: file inventory, format(s), row/record counts, schema (column names + inferred types), null/dup rates, encoding issues, and a small sample.
Based on the profile and the target task, write a short prep plan to the workspace as prep-plan.md. Typical line items:
pii-scanner subagent.column-curator subagent.format-normalizer subagent.id — queue format-normalizer for this too.format-normalizer.schema-designer subagent after reshape.Show the plan to the user. Do not execute steps the user hasn't approved. Let them strike, reorder, or add items.
Run each approved step by invoking the corresponding subagent with a precise brief: input path, output path, the relevant slice of the profile, and the target task. After each step, re-profile briefly so the next step works on current data.
Persistent intermediate artifacts go in <workspace>/stages/<NN>-<step>/. The final cleaned dataset lands in <workspace>/final/.
When prep is done, suggest the next skill based on scale and feedback needs:
annotate-with-claude — Claude annotates each record interactively in-session, with the user supervising.scaffold-annotation-env — sets up Gemini batch inference over the prepared data.hf-setup — creates the HF dataset repo, copies data over, generates the card.hf-setup once labels are populated.<workspace>/schema.json and refer back to it from every later stage.