Skill

golden-dataset

Install
1
Install the plugin
$
npx claudepluginhub yonatangross/orchestkit --plugin ork

Want just this skill?

Add to a custom plugin, then install with one command.

Description

Golden dataset lifecycle patterns for curation, versioning, quality validation, and CI integration. Use when building evaluation datasets, managing dataset versions, validating quality scores, or integrating golden tests into pipelines.

Tool Access

This skill is limited to using the following tools:

ReadGlobGrepWebFetchWebSearch
Supporting Assets
View in Repository
checklists/backup-restore-checklist.md
examples/orchestkit-dataset-workflow.md
metadata.json
references/annotation-patterns.md
references/backup-restore.md
references/quality-metrics.md
references/selection-criteria.md
references/storage-patterns.md
references/validation-contracts.md
references/validation-rules.md
references/versioning.md
rules/_sections.md
rules/_template.md
rules/curation-add-workflow.md
rules/curation-annotation.md
rules/curation-collection.md
rules/curation-diversity.md
rules/management-ci.md
rules/management-storage.md
rules/management-versioning.md
Skill Content

Golden Dataset

Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in rules/ loaded on-demand.

Quick Reference

CategoryRulesImpactWhen to Use
Curation3HIGHContent collection, annotation pipelines, diversity analysis
Management3HIGHVersioning, backup/restore, CI/CD automation
Validation3CRITICALQuality scoring, drift detection, regression testing
Add Workflow1HIGH9-phase curation, quality scoring, bias detection, silver-to-gold

Total: 10 rules across 4 categories

Curation

Content collection, multi-agent annotation, and diversity analysis for golden datasets.

RuleFileKey Pattern
Collectionrules/curation-collection.mdContent type classification, quality thresholds, duplicate prevention
Annotationrules/curation-annotation.mdMulti-agent pipeline, consensus aggregation, Langfuse tracing
Diversityrules/curation-diversity.mdDifficulty stratification, domain coverage, balance guidelines

Management

Versioning, storage, and CI/CD automation for golden datasets.

RuleFileKey Pattern
Versioningrules/management-versioning.mdJSON backup format, embedding regeneration, disaster recovery
Storagerules/management-storage.mdBackup strategies, URL contract, data integrity checks
CI Integrationrules/management-ci.mdGitHub Actions automation, pre-deployment validation, weekly backups

Validation

Quality scoring, drift detection, and regression testing for golden datasets.

RuleFileKey Pattern
Qualityrules/validation-quality.mdSchema validation, content quality, referential integrity
Driftrules/validation-drift.mdDuplicate detection, semantic similarity, coverage gap analysis
Regressionrules/validation-regression.mdDifficulty distribution, pre-commit hooks, full dataset validation

Add Workflow

Structured workflow for adding new documents to the golden dataset.

RuleFileKey Pattern
Add Documentrules/curation-add-workflow.md9-phase curation, parallel quality analysis, bias detection

Quick Start Example

from app.shared.services.embeddings import embed_text

async def validate_before_add(document: dict, source_url_map: dict) -> dict:
    """Pre-addition validation for golden dataset entries."""
    errors = []

    # 1. URL contract check
    if "placeholder" in document.get("source_url", ""):
        errors.append("URL must be canonical, not a placeholder")

    # 2. Content quality
    if len(document.get("title", "")) < 10:
        errors.append("Title too short (min 10 chars)")

    # 3. Tag requirements
    if len(document.get("tags", [])) < 2:
        errors.append("At least 2 domain tags required")

    return {"valid": len(errors) == 0, "errors": errors}

Key Decisions

DecisionRecommendation
Backup formatJSON (version controlled, portable)
Embedding storageExclude from backup (regenerate on restore)
Quality threshold>= 0.70 quality score for inclusion
Confidence threshold>= 0.65 for auto-include
Duplicate threshold>= 0.90 similarity blocks, >= 0.85 warns
Min tags per entry2 domain tags
Min test queries3 per document
Difficulty balanceTrivial 3, Easy 3, Medium 5, Hard 3 minimum
CI frequencyWeekly automated backup (Sunday 2am UTC)

Common Mistakes

  1. Using placeholder URLs instead of canonical source URLs
  2. Skipping embedding regeneration after restore
  3. Not validating referential integrity between documents and queries
  4. Over-indexing on articles (neglecting tutorials, research papers)
  5. Missing difficulty distribution balance in test queries
  6. Not running verification after backup/restore operations
  7. Testing restore procedures in production instead of staging
  8. Committing SQL dumps instead of JSON (not version-control friendly)

Evaluations

See test-cases.json for 9 test cases across all categories.

Related Skills

  • ork:rag-retrieval - Retrieval evaluation using golden dataset
  • langfuse-observability - Tracing patterns for curation workflows
  • ork:testing-unit - Unit testing patterns and strategies
  • ai-native-development - Embedding generation for restore

Capability Details

curation

Keywords: golden dataset, curation, content collection, annotation, quality criteria

Solves:

  • Classify document content types for golden dataset
  • Run multi-agent quality analysis pipelines
  • Generate test queries for new documents

management

Keywords: golden dataset, backup, restore, versioning, disaster recovery

Solves:

  • Backup and restore golden datasets with JSON
  • Regenerate embeddings after restore
  • Automate backups with CI/CD

validation

Keywords: golden dataset, validation, schema, duplicate detection, quality metrics

Solves:

  • Validate entries against document schema
  • Detect duplicate or near-duplicate entries
  • Analyze dataset coverage and distribution gaps
Stats
Stars128
Forks14
Last CommitMar 15, 2026
Actions

Similar Skills