From curry-train
Estimate the compute and dollar cost of a proposed training run before launching it, and compare against the expected gain from the small-scale ablation. Activate when the user asks "how much will this cost", "is this run worth the compute", "compute budget estimator", "how long will this take", or considers launching a multi-day run.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainThis skill uses the workspace's default tool permissions.
Before launching a long run, estimate its compute cost and weigh it against the *expected* gain (from small-scale ablation + scaling fit). This is the most under-used Stage 3 step in practice and the single biggest source of regretted compute.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.
Share bugs, ideas, or general feedback.
Before launching a long run, estimate its compute cost and weigh it against the expected gain (from small-scale ablation + scaling fit). This is the most under-used Stage 3 step in practice and the single biggest source of regretted compute.
"Given the predicted improvement, is this large run economically rational compared to the alternatives?"
For each candidate run:
N, training tokens T: FLOPs ≈ 6 · N · T (Kaplan 2020 approximation).wall_time ≈ FLOPs / (GPU_count · GPU_FLOPS · MFU), where MFU (model FLOPs utilization) is typically 0.3–0.5 for dense transformers, 0.15–0.3 for MoE, 0.1–0.2 for CNN-heavy or SNN models.wall_time · GPU_count · $/GPU-hour.Compute is justified only when:
stage3-scaling-fit) is large enough to be worth the dollar cost. Sounds obvious; rarely actually computed.If predicted gain at scale is ambiguous, the right move is usually to collect more small-scale data rather than gamble.
A user wants to fine-tune a 7B parameter model on 100B tokens.
N = 7e9
T = 1e11
FLOPs = 6 · 7e9 · 1e11 = 4.2e21
GPUs = 8 × A100 (each ~ 312 TFLOPs FP16 ≈ 3.12e14)
MFU = 0.4
wall_time = 4.2e21 / (8 · 3.12e14 · 0.4)
≈ 4.2e21 / 1e15
≈ 4.2e6 s ≈ 50 days
$ at $1.50/A100-hr
$ ≈ 50 · 24 · 8 · 1.50 = $14,400
For the user to justify $14,400, the predicted gain (from a Stage 3 scaling fit) must be both quantified and tied to a concrete benefit (publication, deployable improvement, downstream contract). "I think this will help" does not justify it.
Lives at template/curry_train/prevalidate/compute_budget.py. Sketch:
def estimate_run_cost(N_params, T_tokens, gpus, gpu_tflops, mfu, dollar_per_gpu_hour):
flops = 6 * N_params * T_tokens
aggregate_tflops = gpus * gpu_tflops * mfu
wall_seconds = flops / (aggregate_tflops * 1e12)
wall_hours = wall_seconds / 3600
cost = wall_hours * gpus * dollar_per_gpu_hour
return {
"flops": flops,
"wall_hours": wall_hours,
"wall_days": wall_hours / 24,
"dollar_cost": cost,
}
Add a decision_summary(estimate, predicted_gain, gain_uncertainty, baseline_to_beat) helper that produces a one-line verdict.
Get the four inputs from the user: N, T, GPU type/count, and MFU (0.3–0.5 default for dense transformer, lower for unconventional).
Compute and present the cost estimate as a small markdown table: FLOPs, wall hours, days, $.
Ask what predicted gain the run is supposed to produce — pull from stage3-scaling-fit or stage3-small-scale-ablation.
Render the decision: gain per dollar, comparison to alternatives. Be honest if the answer is "the predicted gain isn't worth this cost".
Suggest cheaper alternatives if the answer is no:
stage3-kill-criterion skill. Even with a good budget, every run needs an abort condition.skills/stage3-scaling-fit — produces the predicted gain that compute budget weighs against.skills/stage3-kill-criterion — defines when to abort partway through.skills/stage4-capacity-sweep — determines N.skills/stage5-checkpoint-cadence — checkpoint cadence affects whether a partial run has salvageable value.template/curry_train/prevalidate/compute_budget.py.