Help us improve
Share bugs, ideas, or general feedback.
From training
Deterministically rebuild a dataset from its manifest and verify fixity equivalence
npx claudepluginhub jmagly/aiwg-trainingHow this skill is triggered — by the user, by Claude, or both
Slash command
/training:dataset-reproduceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Deterministically rebuild a training dataset from a published manifest and verify that the rebuild matches the original via SHA-256 fixity comparison. Used to validate that a dataset published by `dataset-version` is genuinely reproducible, and to let third parties reconstruct a dataset from its manifest without access to the original artifacts.
Verifies ML experiment reproducibility: checks random seeds, library versions, data hashes, git commits, environment files, and result determinism. Use when reviewing before shipping.
Version ML datasets using DVC with remote storage backends, build reproducible data pipelines, and track data lineage alongside Git. Use for large datasets, experiment reproducibility, and compliance auditing.
Create and manage Hugging Face Hub datasets: initialize repos, configure prompts/metadata, stream row updates, and query/transform data with DuckDB SQL.
Share bugs, ideas, or general feedback.
Deterministically rebuild a training dataset from a published manifest and verify that the rebuild matches the original via SHA-256 fixity comparison. Used to validate that a dataset published by dataset-version is genuinely reproducible, and to let third parties reconstruct a dataset from its manifest without access to the original artifacts.
Per the ML Reproducibility Checklist (REF-475), a dataset is "reproducible" only if an independent rebuild from the manifest produces byte-identical fixity hashes. This skill is the verifier.
dataset-version publishes, rebuild and compare to catch non-determinism before the dataset is used for training<manifest-path> (required)Path to a datasets/<version>.yaml manifest published by dataset-version. The YAML is authoritative — sibling .json exports are ignored by this skill.
--compare-fixity (optional)Compare rebuilt SHA-256 against the original fixity_manifest. Default: true. Disable only for partial rebuilds where comparison is not meaningful.
--workdir <path> (optional)Scratch directory for the rebuild. Default: .aiwg/training/reproduce/<version>-<timestamp>/. Must be empty; the skill refuses to overwrite.
<manifest-path> as YAML; validate against the schema at @agentic/code/frameworks/training-complete/schemas/dataset-manifest.yaml. Resolve sources[], reproduction_recipe, seed, and split_counts.reproduction_recipe.aiwg_version and training_complete_version against the current runtime. On mismatch, emit a WARN — reproducibility across versions is not guaranteed. Proceeding is allowed but flagged in the report.sources[], invoke acquire-training-source using the declared ref_id, license, and format. Fixity of each acquired source is checked against any upstream checksum; mismatch is a hard failure (the source has drifted).reproduction_recipe step-for-step: generator_configs via synthetic-data-generator and example-synthesizer; preference_config via preference-generator; filter_thresholds via quality filters; decontamination_thresholds via decontamination-check. Format exports via the declared format_exports adapters.seed. No new entropy is introduced. The same seed + same inputs + same configs MUST produce the same outputs.integrity-verification to emit a fresh SHA-256 manifest over the rebuilt dataset.fixity_manifest. Emit per-file match/mismatch with a summary verdict (MATCH, PARTIAL, MISMATCH). Write the report.A mismatch does not always mean a bug. Known non-determinism sources to document in the report:
reproduction_recipe.generator_configs.created_at in a per-example record will always differ; these are excluded from fixity scope.The report's "Mismatch Analysis" section classifies each divergence against this list.
reports/reproduce-<version>-<timestamp>.md containing:
MATCH / PARTIAL / MISMATCH) with example countsacquire-training-source (training-complete)example-synthesizer, synthetic-data-generator (training-complete)preference-generator (training-complete)format-adapter-alpaca, format-adapter-sharegpt, format-adapter-chatml, format-adapter-jsonl, format-adapter-parquetdecontamination-check (training-complete)@agentic/code/frameworks/media-curator/skills/integrity-verification/SKILL.md# Self-verify a freshly-published dataset
dataset-reproduce datasets/2026.4.0.yaml
# Reproduce with explicit workdir and no fixity comparison (partial rebuild)
dataset-reproduce datasets/2026.4.0.yaml \
--workdir /tmp/repro-2026.4.0 \
--compare-fixity false
@agentic/code/frameworks/training-complete/schemas/dataset-manifest.yaml — manifest schema and validation rules