From ork
Provides patterns for curating, versioning, validating quality, and integrating golden datasets into CI pipelines for AI/ML evaluations and LLM testing.
npx claudepluginhub yonatangross/orchestkit --plugin orkThis skill is limited to using the following tools:
Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in `rules/` loaded on-demand.
checklists/backup-restore-checklist.mdexamples/orchestkit-dataset-workflow.mdmetadata.jsonreferences/annotation-patterns.mdreferences/backup-restore.mdreferences/quality-metrics.mdreferences/selection-criteria.mdreferences/storage-patterns.mdreferences/validation-contracts.mdreferences/validation-rules.mdreferences/versioning.mdrules/_sections.mdrules/_template.mdrules/curation-add-workflow.mdrules/curation-annotation.mdrules/curation-collection.mdrules/curation-diversity.mdrules/management-ci.mdrules/management-storage.mdrules/management-versioning.mdImplements data quality validation with Great Expectations, dbt tests, and data contracts. Useful for building data pipelines, validation rules, and team contracts.
Implements data quality validation with Great Expectations, dbt tests, and data contracts for pipelines, rules, and team agreements.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Share bugs, ideas, or general feedback.
Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in rules/ loaded on-demand.
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Curation | 3 | HIGH | Content collection, annotation pipelines, diversity analysis |
| Management | 3 | HIGH | Versioning, backup/restore, CI/CD automation |
| Validation | 3 | CRITICAL | Quality scoring, drift detection, regression testing |
| Add Workflow | 1 | HIGH | 9-phase curation, quality scoring, bias detection, silver-to-gold |
Total: 10 rules across 4 categories
Content collection, multi-agent annotation, and diversity analysis for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Collection | rules/curation-collection.md | Content type classification, quality thresholds, duplicate prevention |
| Annotation | rules/curation-annotation.md | Multi-agent pipeline, consensus aggregation, Langfuse tracing |
| Diversity | rules/curation-diversity.md | Difficulty stratification, domain coverage, balance guidelines |
Versioning, storage, and CI/CD automation for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Versioning | rules/management-versioning.md | JSON backup format, embedding regeneration, disaster recovery |
| Storage | rules/management-storage.md | Backup strategies, URL contract, data integrity checks |
| CI Integration | rules/management-ci.md | GitHub Actions automation, pre-deployment validation, weekly backups |
Quality scoring, drift detection, and regression testing for golden datasets.
| Rule | File | Key Pattern |
|---|---|---|
| Quality | rules/validation-quality.md | Schema validation, content quality, referential integrity |
| Drift | rules/validation-drift.md | Duplicate detection, semantic similarity, coverage gap analysis |
| Regression | rules/validation-regression.md | Difficulty distribution, pre-commit hooks, full dataset validation |
Structured workflow for adding new documents to the golden dataset.
| Rule | File | Key Pattern |
|---|---|---|
| Add Document | rules/curation-add-workflow.md | 9-phase curation, parallel quality analysis, bias detection |
from app.shared.services.embeddings import embed_text
async def validate_before_add(document: dict, source_url_map: dict) -> dict:
"""Pre-addition validation for golden dataset entries."""
errors = []
# 1. URL contract check
if "placeholder" in document.get("source_url", ""):
errors.append("URL must be canonical, not a placeholder")
# 2. Content quality
if len(document.get("title", "")) < 10:
errors.append("Title too short (min 10 chars)")
# 3. Tag requirements
if len(document.get("tags", [])) < 2:
errors.append("At least 2 domain tags required")
return {"valid": len(errors) == 0, "errors": errors}
| Decision | Recommendation |
|---|---|
| Backup format | JSON (version controlled, portable) |
| Embedding storage | Exclude from backup (regenerate on restore) |
| Quality threshold | >= 0.70 quality score for inclusion |
| Confidence threshold | >= 0.65 for auto-include |
| Duplicate threshold | >= 0.90 similarity blocks, >= 0.85 warns |
| Min tags per entry | 2 domain tags |
| Min test queries | 3 per document |
| Difficulty balance | Trivial 3, Easy 3, Medium 5, Hard 3 minimum |
| CI frequency | Weekly automated backup (Sunday 2am UTC) |
See test-cases.json for 9 test cases across all categories.
ork:rag-retrieval - Retrieval evaluation using golden datasetlangfuse-observability - Tracing patterns for curation workflowsork:testing-unit - Unit testing patterns and strategiesai-native-development - Embedding generation for restoreKeywords: golden dataset, curation, content collection, annotation, quality criteria
Solves:
Keywords: golden dataset, backup, restore, versioning, disaster recovery
Solves:
Keywords: golden dataset, validation, schema, duplicate detection, quality metrics
Solves: