Skill

infra-optuna-sweep

Concrete recipe for running an Optuna-driven hyperparameter sweep through Hydra, with TPE/CMA-ES/Hyperband, distributed multi-rank trials, study persistence, and per-trial run journal. Activate when the user asks "set up an Optuna sweep", "run hyperparameter search", "Hydra Optuna sweeper", or "parallel HPO".

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

The plumbing layer beneath `stage4-optuna-integration`. This skill defines **how** the sweep runs (storage, parallelism, journal integration); the *when* and *what* live in the Stage 4 skill.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Infra · Optuna sweep recipe

The plumbing layer beneath stage4-optuna-integration. This skill defines how the sweep runs (storage, parallelism, journal integration); the when and what live in the Stage 4 skill.

Components

Hydra-Optuna sweeper plugin: pip install hydra-optuna-sweeper.
Storage: SQLite for ≤ 100 trials, PostgreSQL for larger or multi-machine.
Sampler: TPE default; CMA-ES for high-dim; GridSampler for sanity.
Pruner: Hyperband for cheap early-stop; MedianPruner for simpler.

Search-config example

# configs/search/lr_wd.yaml
defaults:
  - override hydra/sweeper: optuna
  - override hydra/sweeper/sampler: tpe

hydra:
  sweeper:
    direction: minimize
    n_trials: 64
    n_jobs: 4              # parallel trial workers (single-machine)
    study_name: ${experiment.name}
    storage: sqlite:///${paths.runs}/${experiment.name}/study.db
    sampler:
      _target_: optuna.samplers.TPESampler
      multivariate: true
    pruner:
      _target_: optuna.pruners.HyperbandPruner
      min_resource: 200
      max_resource: 5000
    params:
      training.lr: tag(log, interval(1e-5, 1e-2))
      training.weight_decay: tag(log, interval(1e-4, 1e-1))
      training.warmup_steps: int(interval(0, 2000))
      model.dropout: interval(0.0, 0.2)

Trial entry point

@hydra.main(config_path="configs", config_name="config")
def main(cfg: DictConfig) -> float:
    """One Optuna trial = one curryTrain run.

    Returns the headline metric to be minimized.
    """
    set_seed(cfg.seed)
    with Run(cfg) as run:
        result = train_and_evaluate(cfg, run, fabric)
        return float(result.headline_metric)

The function's return value is what Optuna optimizes. Wrapping in Run(cfg) ensures every trial is a fully-journaled curryTrain run.

Pruning hooks (intermediate values)

import optuna

def report_intermediate(trial: optuna.Trial, step: int, value: float):
    trial.report(value, step)
    if trial.should_prune():
        raise optuna.TrialPruned()

Hook this to the trial's eval cadence (e.g., every 500 steps). Pruning typically reduces compute by 2–4×.

Distributed trials

For multi-machine HPO, point all workers at the same study:

# On each machine:
python train.py -m search=lr_wd experiment.name=foo \
  hydra.sweeper.storage=postgresql://hpo@db.example/optuna

Optuna's database lock handles concurrent trial proposals. Each worker pulls a trial from the shared study.

Persistence and resume

# Resume an interrupted study (same study_name + storage):
python train.py -m search=lr_wd experiment.name=foo

Optuna picks up where it left off, including pruned trials.

Procedure when assisting a user

Confirm Hydra is set up (infra-hydra-config). Without that, the Optuna sweeper is messy to bolt on.
Decide search dimensions with stage4-optuna-integration (don't search 12 params at once).
Write the search config in configs/search/<name>.yaml.
Choose pruning. For trials that have meaningful intermediate metrics (most do), enable Hyperband. Skip pruning only if every trial is short.
Storage: sqlite:/// for solo work, PostgreSQL for shared. Always specify storage: so trials persist.
Run: python train.py -m search=<name> experiment.name=....
After the sweep, generate optuna.visualization.plot_param_importances(study) and plot_optimization_history(study). Save these next to the study DB.

Multi-seed within trial

If the per-trial variance is significant (typical for stochastic deep learning), wrap each trial with N=2-3 seeds and return the mean:

def main(cfg: DictConfig) -> float:
    losses = []
    for seed in range(cfg.search_n_seeds):
        cfg_s = copy.deepcopy(cfg)
        cfg_s.seed = seed
        losses.append(train_one(cfg_s))
    return float(np.mean(losses))

This 2-3× the cost per trial but stops Optuna from chasing seed noise.

Boundaries

Optuna does not handle architecture search well at scale. For NAS, look elsewhere — but most users don't need NAS.
Hydra's n_jobs is intra-process; for cross-machine you need a shared DB and external job launching.
Pruning requires the trial to report intermediate values; without those reports, pruning has nothing to act on.

Common mistakes

Searching at linear scale on LR / wd → never finds the right value.
Trial's return value is loss over training (not validation) → finds overfit configs.
Running with sqlite:/// and many parallel workers → SQLite locks; use Postgres.
Forgetting to seed within a trial → same config produces different scores; Optuna confused.

skills/stage4-optuna-integration — strategic when/what to search.
skills/infra-hydra-config — config plumbing.
skills/stage5-run-journal — every trial is a journaled run.
Optuna docs; hydra-optuna-sweeper docs.

infra-optuna-sweep

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

infra-optuna-sweep

Tool Access

Preview

SKILL.md

Infra · Optuna sweep recipe

Components

Search-config example

Trial entry point

Pruning hooks (intermediate values)

Distributed trials

Persistence and resume

Procedure when assisting a user

Multi-seed within trial

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Infra · Optuna sweep recipe

Components

Search-config example

Trial entry point

Pruning hooks (intermediate values)

Distributed trials

Persistence and resume

Procedure when assisting a user

Multi-seed within trial

Boundaries

Common mistakes

Related