Search everything...

Stats

Actions

Available In

training

Corpus-to-dataset pipeline for AI training data curation. Ingests sources, synthesizes examples, generates preference pairs, applies decontamination, and exports to Alpaca/ShareGPT/ChatML/JSONL/Parquet with provenance and reproducibility. Grounded in 485 research REFs covering DPO/KTO/ORPO/SimPO, Self-Instruct/Evol/Orca/Phi/PersonaHub/STaR/ReST, Model Collapse guard, Datasheets/Model Cards/Data Statements, HF Datasets/Arrow+Parquet.

ClaudePluginHub fallbackgenerated from the plugin's indexed repository (jmagly/aiwg-training) — no native marketplace selected

npx claudepluginhub jmagly/aiwg-training

Popularity

Stars

Above avg

Med: 0·Avg: 862

Copy clicks

Med: 0·Avg: 2

What's Inside

Agents7

dataset-evaluator-agent

/dataset-evaluator-agent

Computes dataset-level metrics (diversity, difficulty, domain balance, quality grade distribution) and prepares the matric-eval handoff package for model evaluation.

dataset-publication-agent

/dataset-publication-agent

Coordinates dataset versioning, datasheet/model card generation, integrity manifests, and the publication gate including override escalation paths.

decontamination-agent

/decontamination-agent

Runs exact, fuzzy, and semantic contamination checks against eval-set targets and feeds the publication gate.

example-synthesizer-agent

/example-synthesizer-agent

Generates SFT training examples from admitted sources using self-instruct, evol-instruct, squad, and STaR patterns with per-example provenance.

format-converter-agent

/format-converter-agent

Runs mechanical format adapters (alpaca, sharegpt, chatml, jsonl, parquet) with round-trip validation and sidecar metadata.

Skills15

acquire-training-source

/acquire-training-source

Acquire a training data source with license validation and delegate ingest to the semantic memory kernel

dataset-docs

/dataset-docs

Generate Datasheet, Model Card, and Data Statement from a dataset manifest

dataset-reproduce

/dataset-reproduce

Deterministically rebuild a dataset from its manifest and verify fixity equivalence

dataset-version

/dataset-version

Create a versioned training dataset with manifest, fixity, provenance, and archive snapshot

decontamination-check

/decontamination-check

Detect training-eval overlap against benchmark sets before dataset publication

Stats

Version1.0.0

Stars1

MaintenanceExcellent

LicenseMIT

Last CommitApr 16, 2026

AddedApr 16, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Safety Signals

Caution

Uses power tools

Uses Bash, Write, or Edit tools

README

aiwg-training

Corpus-to-dataset pipeline for AI training data curation

15 skills, 7 agents, 5 format adapters, 3 decontamination modes, 6 default benchmark targets, 485 research REFs. Agentic surface that works out of the box + optional Python runtime for scale.

/plugin install training@aiwg      # Claude Code plugin install
aiwg use training                   # or via AIWG CLI

Get Started · What You Get · Architecture · Research · Docs

What aiwg-training Is

aiwg-training is a marketplace plugin for AIWG that turns any corpus — research papers, code repositories, conversation logs, documentation sites — into training-ready datasets for fine-tuning language models. It produces datasets suitable for SFT, DPO, KTO, ORPO, SimPO, and GRPO training workflows, with full provenance, license inheritance, benchmark decontamination, and byte-for-byte reproducibility.

If you have tried to build a fine-tuning dataset and ended up with ad-hoc scripts, manually curated JSONL files, mystery licenses, and hope-this-doesn't-contaminate-the-eval vibes, aiwg-training is the missing infrastructure layer. It implements every published best practice from dataset methodology research (Self-Instruct, Evol-Instruct, Orca, PersonaHub, STaR), preference-optimization research (DPO, KTO, ORPO, SimPO), governance standards (Datasheets for Datasets, Model Cards, Data Statements, ML Reproducibility Checklist), and safety research (Benchmark Contamination, Model Collapse, Llama Guard) behind a single cohesive framework.

Unlike HuggingFace datasets (storage format) or Axolotl (training orchestrator), aiwg-training is a curation pipeline. It ingests, assesses, synthesizes, filters, formats, decontaminates, versions, and documents — the work that happens before you invoke trainer.train() and the part that determines whether your fine-tune actually learns anything useful.

What Problems aiwg-training Solves

Building a fine-tuning dataset is hard in ways that don't show up in tutorials. Four failure modes dominate:

1. No Provenance, No Reproducibility

Typical dataset scripts produce JSONL files with no record of where each example came from, what license governs it, what transformations were applied, or how to rebuild the same dataset again next week. When something goes wrong — a model overfits a biased subsample, a source is later retracted, a license changes — there's no way to trace or fix it.

Without aiwg-training: 70%+ of published fine-tuning datasets fail the ML Reproducibility Checklist (Pineau et al. 2020). Lineage from raw source to trained model is almost always missing.

With aiwg-training: Every example traces back to its source via W3C PROV (REF-062). Every dataset version ships with a SHA-256 fixity manifest + deterministic seed + reproduction recipe. aiwg-training dataset reproduce byte-reproduces any prior version.

2. Benchmark Contamination

Most fine-tuning datasets accidentally include examples from the benchmarks you'll later use to evaluate the model. Your "HumanEval 67.2%" score is meaningless if 40% of HumanEval was in your training data. Published papers have been retracted over this.

Without aiwg-training: Benchmark leakage is detected post-hoc, if ever. REF-442 (Sainz et al. 2023) shows ChatGPT reproduces CoNLL-2003 verbatim — pervasive contamination across major benchmarks.

With aiwg-training: Decontamination is a first-class pipeline stage that blocks publication. Three detection modes (exact 13-gram per REF-442, fuzzy edit-distance, semantic embedding similarity). Six default targets (MMLU, GSM8K, HumanEval, HELM, MT-Bench, AlpacaEval) extensible to any benchmark. The decontamination-gate lint rule makes override explicit with triple audit trail (manifest + activity log + report appendix).

3. License Laundering

View full README on GitHub

training

Popularity

What's Inside

Confidence

README

aiwg-training

What aiwg-training Is

What Problems aiwg-training Solves

1. No Provenance, No Reproducibility

2. Benchmark Contamination

3. License Laundering

Similar Plugins

sdg-hub

superml

training-hub

ml-model-trainer

data-scientist

book-training

More by jmagly

marketing

sdlc

voice

writing

utils

aiwg-training

What aiwg-training Is

What Problems aiwg-training Solves

1. No Provenance, No Reproducibility

2. Benchmark Contamination

3. License Laundering

Popularity

Health & Quality

More by jmagly

marketing

sdlc

voice

writing

utils

Similar Plugins

sdg-hub

superml

training-hub

ml-model-trainer

data-scientist

book-training