Golden dataset lifecycle patterns for curation, versioning, quality validation, and CI integration. Use when building evaluation datasets, managing dataset versions, validating quality scores, or integrating golden tests into pipelines.
Manages golden dataset lifecycle for AI evaluation including curation, versioning, quality validation, and CI integration.
/plugin marketplace add yonatangross/orchestkit/plugin install orkl@orchestkitThis skill inherits all available tools. When active, it can use any tool Claude has access to.
checklists/backup-restore-checklist.mdexamples/orchestkit-dataset-workflow.mdmetadata.jsonreferences/annotation-patterns.mdreferences/backup-restore.mdreferences/quality-metrics.mdreferences/selection-criteria.mdreferences/storage-patterns.mdreferences/validation-contracts.mdreferences/validation-rules.mdreferences/versioning.mdrules/_sections.mdrules/_template.mdrules/curation-add-workflow.mdrules/curation-annotation.mdrules/curation-collection.mdrules/curation-diversity.mdrules/management-ci.mdrules/management-storage.mdrules/management-versioning.mdComprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in rules/ loaded on-demand.
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Curation | 3 | HIGH | Content collection, annotation pipelines, diversity analysis |
| Management | 3 | HIGH | Versioning, backup/restore, CI/CD automation |
| Validation | 3 | CRITICAL | Quality scoring, drift detection, regression testing |
| Add Workflow | 1 | HIGH | 9-phase curation, quality scoring, bias detection, silver-to-gold |
Total: 10 rules across 4 categories
Content collection, multi-agent annotation, and diversity analysis for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Collection | rules/curation-collection.md | Content type classification, quality thresholds, duplicate prevention |
| Annotation | rules/curation-annotation.md | Multi-agent pipeline, consensus aggregation, Langfuse tracing |
| Diversity | rules/curation-diversity.md | Difficulty stratification, domain coverage, balance guidelines |
Versioning, storage, and CI/CD automation for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Versioning | rules/management-versioning.md | JSON backup format, embedding regeneration, disaster recovery |
| Storage | rules/management-storage.md | Backup strategies, URL contract, data integrity checks |
| CI Integration | rules/management-ci.md | GitHub Actions automation, pre-deployment validation, weekly backups |
Quality scoring, drift detection, and regression testing for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Quality | rules/validation-quality.md | Schema validation, content quality, referential integrity |
| Drift | rules/validation-drift.md | Duplicate detection, semantic similarity, coverage gap analysis |
| Regression | rules/validation-regression.md | Difficulty distribution, pre-commit hooks, full dataset validation |
Structured workflow for adding new documents to the golden dataset.
| Rule | File | Key Pattern |
|---|---|---|
| Add Document | rules/curation-add-workflow.md | 9-phase curation, parallel quality analysis, bias detection |
from app.shared.services.embeddings import embed_text
async def validate_before_add(document: dict, source_url_map: dict) -> dict:
"""Pre-addition validation for golden dataset entries."""
errors = []
# 1. URL contract check
if "placeholder" in document.get("source_url", ""):
errors.append("URL must be canonical, not a placeholder")
# 2. Content quality
if len(document.get("title", "")) < 10:
errors.append("Title too short (min 10 chars)")
# 3. Tag requirements
if len(document.get("tags", [])) < 2:
errors.append("At least 2 domain tags required")
return {"valid": len(errors) == 0, "errors": errors}
| Decision | Recommendation |
|---|---|
| Backup format | JSON (version controlled, portable) |
| Embedding storage | Exclude from backup (regenerate on restore) |
| Quality threshold | >= 0.70 quality score for inclusion |
| Confidence threshold | >= 0.65 for auto-include |
| Duplicate threshold | >= 0.90 similarity blocks, >= 0.85 warns |
| Min tags per entry | 2 domain tags |
| Min test queries | 3 per document |
| Difficulty balance | Trivial 3, Easy 3, Medium 5, Hard 3 minimum |
| CI frequency | Weekly automated backup (Sunday 2am UTC) |
See test-cases.json for 9 test cases across all categories.
ork:rag-retrieval - Retrieval evaluation using golden datasetlangfuse-observability - Tracing patterns for curation workflowsork:testing-patterns - General testing patterns and strategiesai-native-development - Embedding generation for restoreKeywords: golden dataset, curation, content collection, annotation, quality criteria
Solves:
Keywords: golden dataset, backup, restore, versioning, disaster recovery
Solves:
Keywords: golden dataset, validation, schema, duplicate detection, quality metrics
Solves:
Search, retrieve, and install Agent Skills from the prompts.chat registry using MCP tools. Use when the user asks to find skills, browse skill catalogs, install a skill for Claude, or extend Claude's capabilities with reusable AI agent components.
Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.