From curry-train
Concrete recipe for running an Optuna-driven hyperparameter sweep through Hydra, with TPE/CMA-ES/Hyperband, distributed multi-rank trials, study persistence, and per-trial run journal. Activate when the user asks "set up an Optuna sweep", "run hyperparameter search", "Hydra Optuna sweeper", or "parallel HPO".
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
The plumbing layer beneath `stage4-optuna-integration`. This skill defines **how** the sweep runs (storage, parallelism, journal integration); the *when* and *what* live in the Stage 4 skill.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
The plumbing layer beneath stage4-optuna-integration. This skill defines how the sweep runs (storage, parallelism, journal integration); the when and what live in the Stage 4 skill.
pip install hydra-optuna-sweeper.# configs/search/lr_wd.yaml
defaults:
- override hydra/sweeper: optuna
- override hydra/sweeper/sampler: tpe
hydra:
sweeper:
direction: minimize
n_trials: 64
n_jobs: 4 # parallel trial workers (single-machine)
study_name: ${experiment.name}
storage: sqlite:///${paths.runs}/${experiment.name}/study.db
sampler:
_target_: optuna.samplers.TPESampler
multivariate: true
pruner:
_target_: optuna.pruners.HyperbandPruner
min_resource: 200
max_resource: 5000
params:
training.lr: tag(log, interval(1e-5, 1e-2))
training.weight_decay: tag(log, interval(1e-4, 1e-1))
training.warmup_steps: int(interval(0, 2000))
model.dropout: interval(0.0, 0.2)
@hydra.main(config_path="configs", config_name="config")
def main(cfg: DictConfig) -> float:
"""One Optuna trial = one curryTrain run.
Returns the headline metric to be minimized.
"""
set_seed(cfg.seed)
with Run(cfg) as run:
result = train_and_evaluate(cfg, run, fabric)
return float(result.headline_metric)
The function's return value is what Optuna optimizes. Wrapping in Run(cfg) ensures every trial is a fully-journaled curryTrain run.
import optuna
def report_intermediate(trial: optuna.Trial, step: int, value: float):
trial.report(value, step)
if trial.should_prune():
raise optuna.TrialPruned()
Hook this to the trial's eval cadence (e.g., every 500 steps). Pruning typically reduces compute by 2–4×.
For multi-machine HPO, point all workers at the same study:
# On each machine:
python train.py -m search=lr_wd experiment.name=foo \
hydra.sweeper.storage=postgresql://hpo@db.example/optuna
Optuna's database lock handles concurrent trial proposals. Each worker pulls a trial from the shared study.
# Resume an interrupted study (same study_name + storage):
python train.py -m search=lr_wd experiment.name=foo
Optuna picks up where it left off, including pruned trials.
Confirm Hydra is set up (infra-hydra-config). Without that, the Optuna sweeper is messy to bolt on.
Decide search dimensions with stage4-optuna-integration (don't search 12 params at once).
Write the search config in configs/search/<name>.yaml.
Choose pruning. For trials that have meaningful intermediate metrics (most do), enable Hyperband. Skip pruning only if every trial is short.
Storage: sqlite:/// for solo work, PostgreSQL for shared. Always specify storage: so trials persist.
Run: python train.py -m search=<name> experiment.name=....
After the sweep, generate optuna.visualization.plot_param_importances(study) and plot_optimization_history(study). Save these next to the study DB.
If the per-trial variance is significant (typical for stochastic deep learning), wrap each trial with N=2-3 seeds and return the mean:
def main(cfg: DictConfig) -> float:
losses = []
for seed in range(cfg.search_n_seeds):
cfg_s = copy.deepcopy(cfg)
cfg_s.seed = seed
losses.append(train_one(cfg_s))
return float(np.mean(losses))
This 2-3× the cost per trial but stops Optuna from chasing seed noise.
n_jobs is intra-process; for cross-machine you need a shared DB and external job launching.loss over training (not validation) → finds overfit configs.sqlite:/// and many parallel workers → SQLite locks; use Postgres.skills/stage4-optuna-integration — strategic when/what to search.skills/infra-hydra-config — config plumbing.skills/stage5-run-journal — every trial is a journaled run.