From auto-benchmark
Use this skill when building or operating a continuous, automated benchmarking system that tracks competitor performance, ingests the latest research, generates improvement hypotheses, runs experiments autonomously, and keeps a solution ranked
npx claudepluginhub aviskaar/open-org --plugin auto-benchmark# Auto Benchmark A continuous, autonomous benchmarking system that monitors the competitive landscape, extracts insights from the latest research, proposes and runs improvement experiments, and keeps your solution ranked #1 — so engineers and researchers can focus on building product rather than running benchmarks manually. ## System Overview The system operates as a closed loop that runs on a schedule (daily, weekly, or on trigger): Each iteration of the loop answers one question: **"What can we do right now to move from our current rank to #1?"** --- ## Phase 1 — Competitive Lands...
/SKILLGuides implementation of defense-in-depth security architectures, compliance (SOC2, ISO27001, GDPR, HIPAA), threat modeling, risk assessments, SecOps, incident response, and SDLC security integration.
/SKILLEvaluates LLMs on 60+ benchmarks (MMLU, HumanEval, GSM8K) using lm-eval harness. Provides CLI commands for HuggingFace/vLLM models, task lists, and evaluation checklists.
/SKILLApplies systematic debugging strategies to track down bugs, performance issues, and unexpected behavior using checklists, scientific method, and testing techniques.
/SKILLSummarizes content from URLs, local files, podcasts, and YouTube videos. Extracts transcripts with --extract-only flag. Supports AI models, lengths, and JSON output.
/SKILLRuns `yarn extract-errors` on React project to detect new error messages needing codes, reports them, and verifies existing codes are up to date.
/SKILLManages major dependency upgrades via compatibility analysis, staged rollouts with npm/yarn, and testing for frameworks like React.
A continuous, autonomous benchmarking system that monitors the competitive landscape, extracts insights from the latest research, proposes and runs improvement experiments, and keeps your solution ranked #1 — so engineers and researchers can focus on building product rather than running benchmarks manually.
The system operates as a closed loop that runs on a schedule (daily, weekly, or on trigger):
┌─────────────────────────────────────────────────────────────────┐
│ CONTINUOUS LOOP │
│ │
│ [1] Monitor [2] Ingest [3] Hypothesize │
│ Competitors → Research → Improvements │
│ & Leaderboards Papers from Gap │
│ ↑ ↓ │
│ [6] Report ← [5] Promote ← [4] Experiment │
│ Stakeholders Winners Autonomously │
└─────────────────────────────────────────────────────────────────┘
Each iteration of the loop answers one question: "What can we do right now to move from our current rank to #1?"
Do this once at system initialization; update the registry whenever new competitors emerge.
Store a competitive_registry.yaml that is the system's single source of truth:
domain: memory # e.g., memory, retrieval, reasoning, vision
target_leaderboards:
- name: MemoryBench
url: https://...
scrape_method: html_table # or api, rss, manual
primary_metric: accuracy
higher_is_better: true
- name: LongContextEval
url: https://...
scrape_method: api
primary_metric: f1_score
higher_is_better: true
competitors:
- name: CompetitorA
latest_score: 0.847
source: MemoryBench
last_updated: 2026-02-01
- name: CompetitorB
latest_score: 0.831
source: MemoryBench
last_updated: 2026-02-05
our_solution:
name: OurSystem
current_scores:
MemoryBench: 0.823
LongContextEval: 0.791
promotion_threshold: 0.005 # minimum improvement over current score to promote
State explicitly what "#1" means for each leaderboard:
schedule:
leaderboard_scrape: "0 6 * * *" # daily at 6am
research_ingest: "0 7 * * 1" # weekly on Monday
experiment_sweep: "0 8 * * *" # daily at 8am
report_digest: "0 9 * * 1" # weekly digest on Monday
On each scheduled run, update the competitive landscape before doing anything else.
For each leaderboard in the registry:
competitive_registry.yaml with the latest scores.Emit a competitive delta report on any change:
[ALERT] CompetitorA improved on MemoryBench: 0.847 → 0.861 (+0.014)
[ALERT] New entrant: StartupX at 0.855 — now ranked #2, ahead of us
[STATUS] Our rank: #3 | Gap to #1: -0.038
Produce a ranked table after every scrape:
| Rank | System | MemoryBench | LongContextEval | Δ to Our Score |
|------|--------------|-------------|-----------------|----------------|
| #1 | CompetitorA | 0.861 | 0.812 | -0.038 |
| #2 | StartupX | 0.855 | 0.798 | -0.032 |
| #3 | OurSystem | 0.823 | 0.791 | — |
| #4 | CompetitorB | 0.801 | 0.764 | +0.022 |
The gap to #1 on each leaderboard is the primary input to Phase 3.
Continuously pull the latest research and translate it into actionable improvement candidates.
Configure sources in research_config.yaml:
research_sources:
arxiv_queries:
- "memory augmented neural networks"
- "long context transformers 2026"
- "retrieval augmented generation benchmark"
venues:
- ICLR 2026
- NeurIPS 2025
- ICML 2026
competitor_blogs:
- https://competitor-a.ai/research
citation_tracking:
- track papers that cite our core method
For each new paper found:
research_log.md:## [2026-02-10] Paper: "HyperMemory: Hierarchical State Spaces for Long-Context Recall"
**Source:** arXiv:2602.XXXXX
**Relevant leaderboard:** MemoryBench
**Reported gain:** +4.2% on MemoryBench vs prior SOTA
**Techniques extracted:**
- Hierarchical state compression (effort: Medium, impact: High) ← PRIORITIZED
- Cosine decay + warmup schedule (effort: Low, impact: Low)
- Synthetic data augmentation for long-range dependencies (effort: High, impact: Medium)
**Action:** Generate hypothesis for hierarchical state compression — schedule experiment.
When a competitor publishes a technical report or open-sources code:
Translate the competitive gap and research findings into a ranked experiment queue.
Each hypothesis must state:
hypothesis:
id: H-042
title: "Hierarchical state compression reduces long-context forgetting"
claim: "Applying hierarchical state compression to our memory module will
improve MemoryBench accuracy from 0.823 to ≥0.850"
source: arXiv:2602.XXXXX + gap analysis (gap to #1 = 0.038)
target_leaderboards: [MemoryBench]
implementation:
change: "Replace flat KV cache with 3-level hierarchical compression"
effort: Medium
estimated_gain: +0.027
priority_score: 8.1 # (estimated_gain / effort_score) × urgency_multiplier
status: queued
Maintain a experiment_queue.yaml ranked by priority_score:
Limit the queue to the top 10 active hypotheses. Archive superseded ones.
For tweaks that don't come from papers (e.g., hyperparameter tuning):
experiments/
├── queue/ # pending hypotheses (YAML files)
├── active/ # currently running
├── results/
│ └── <hypothesis_id>/
│ ├── config.yaml
│ ├── metrics.json
│ ├── train.log
│ └── eval_on_leaderboard.json
├── promoted/ # configs promoted to production
└── archived/ # failed or superseded experiments
The automated runner:
archived/ with failure notes. Do not waste full compute on broken configs.results/<hypothesis_id>/metrics.json.promoted/ or archived/ based on Phase 5 promotion logic.requirements.lock).A new configuration is promoted to production only when all of the following are true:
| Criterion | Requirement |
|---|---|
| Primary metric improvement | ≥ promotion_threshold above current production score |
| Statistical significance | p < 0.05 on paired t-test vs production config |
| No regression on secondary metrics | Latency within 10%, memory within 15% |
| Reproducibility | Consistent across ≥ 3 seeds (std < 0.5% of mean) |
| Leaderboard projection | If promoted, would we reach or exceed #1? |
If promotion is approved:
promoted/ with a timestamp.competitive_registry.yaml → our_solution.current_scores.CHANGELOG.md.If rejected, write a clear rejection note explaining which criterion failed.
Once at #1, the system switches to defense mode:
Produce two report types automatically:
# Benchmark Digest — Week of YYYY-MM-DD
## Competitive Position
- MemoryBench: #1 ✅ (our score: 0.871 | gap to #2: +0.010)
- LongContextEval: #2 ⚠️ (our score: 0.812 | gap to #1: -0.006)
## What Changed This Week
- Promoted H-042 (hierarchical compression): +0.031 on MemoryBench
- CompetitorA improved LongContextEval to 0.818 — now ahead of us
## Next Actions (Automated)
- Experiment H-047 (synthetic data aug) queued for LongContextEval — est. gain +0.009
- Research ingest scheduled for Monday
## Experiments Run This Week
- 4 experiments completed | 3 failed fast validation | 1 promoted
Full structured log at TECHNICAL_LOG.md:
Engineers should be able to review the log in under 5 minutes and understand exactly what the system did and why.
One-time setup:
competitive_registry.yaml populated with leaderboards and competitorsresearch_config.yaml configured with paper sources and domain queriesEach automated cycle (verify the system does this):
Escalate to humans only when: