From data-annotation
Stage raw data from a GitHub repo, local path, or remote URL into a working directory so downstream prep skills can operate on it. Use when the user provides a data source and wants it pulled in, or when `shape-dataset` needs to ingest before profiling.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin data-annotationThis skill uses the workspace's default tool permissions.
Unified entry point for getting raw data onto disk in a known location. Does not transform the data — just stages it and reports what was staged.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Share bugs, ideas, or general feedback.
Unified entry point for getting raw data onto disk in a known location. Does not transform the data — just stages it and reports what was staged.
DATA_ROOT="${CLAUDE_USER_DATA:-${XDG_DATA_HOME:-$HOME/.local/share}/claude-plugins}/data-annotation"
WORKSPACE="$DATA_ROOT/workspaces/<dataset-name>/raw"
If the user prefers a path they own (e.g. ~/repos/... or ~/Documents/), use that and store the pointer in $DATA_ROOT/config.json.
github.com/<owner>/<repo> or <owner>/<repo>)Clone shallow:
git clone --depth 1 https://github.com/<owner>/<repo>.git "$WORKSPACE"
Then scan the working tree for data files (.csv, .tsv, .json, .jsonl, .parquet, .arrow, .txt, .md, .yaml, .xml, .xlsx, image/audio dirs). Report a tree of candidates by extension with byte sizes. Do not assume the whole repo is data — README and source code are common.
github.com/<owner>/<repo>/blob/<ref>/<path>)Convert to raw URL and download:
curl -L https://raw.githubusercontent.com/<owner>/<repo>/<ref>/<path> -o "$WORKSPACE/<basename>"
Copy (don't move) the contents into $WORKSPACE/. Preserve directory structure.
curl -L -o "$WORKSPACE/<basename>" <url>. Verify content-length and content-type after download.
.zip, .tar.gz, .tar.bz2, .7z)Download, then extract into $WORKSPACE/. After extraction, delete the archive unless the user wants to keep it.
Write $WORKSPACE/_ingest.json with:
source — original source stringsource_type — one of github-repo, github-blob, local, url-file, url-archivestaged_at — ISO timestamptree — list of files with sizesnotes — anything the user should be aware of (binary blobs, unusual extensions, archive layout)Then briefly summarize for the user: "Staged N files, total X MB, candidates appear to be ..." and hand off to shape-dataset (or whatever the user wants next).
git clone fails, suggest gh repo clone (which uses the user's gh auth) instead of asking for credentials.