Skill

stage3-compute-budget

Estimate the compute and dollar cost of a proposed training run before launching it, and compare against the expected gain from the small-scale ablation. Activate when the user asks "how much will this cost", "is this run worth the compute", "compute budget estimator", "how long will this take", or considers launching a multi-day run.

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Before launching a long run, estimate its compute cost and weigh it against the *expected* gain (from small-scale ablation + scaling fit). This is the most under-used Stage 3 step in practice and the single biggest source of regretted compute.

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stage 3 · Pre-validate · Compute budget estimator

Before launching a long run, estimate its compute cost and weigh it against the expected gain (from small-scale ablation + scaling fit). This is the most under-used Stage 3 step in practice and the single biggest source of regretted compute.

Stage question

"Given the predicted improvement, is this large run economically rational compared to the alternatives?"

What to estimate

For each candidate run:

FLOPs: total floating-point operations.
- For a transformer at parameter count N, training tokens T: FLOPs ≈ 6 · N · T (Kaplan 2020 approximation).
Wall time: wall_time ≈ FLOPs / (GPU_count · GPU_FLOPS · MFU), where MFU (model FLOPs utilization) is typically 0.3–0.5 for dense transformers, 0.15–0.3 for MoE, 0.1–0.2 for CNN-heavy or SNN models.
Memory: peak activation + parameter + optimizer state memory; check fits on hardware before committing.
Dollar cost: wall_time · GPU_count · $/GPU-hour.

The decision

Compute is justified only when:

Predicted gain (from stage3-scaling-fit) is large enough to be worth the dollar cost. Sounds obvious; rarely actually computed.
The predicted gain exceeds the uncertainty in the gain from small-scale variance.
The run does not block more promising experiments (opportunity cost).

If predicted gain at scale is ambiguous, the right move is usually to collect more small-scale data rather than gamble.

Worked example

A user wants to fine-tune a 7B parameter model on 100B tokens.

N = 7e9
T = 1e11
FLOPs = 6 · 7e9 · 1e11 = 4.2e21

GPUs = 8 × A100 (each ~ 312 TFLOPs FP16 ≈ 3.12e14)
MFU = 0.4
wall_time = 4.2e21 / (8 · 3.12e14 · 0.4)
         ≈ 4.2e21 / 1e15
         ≈ 4.2e6 s ≈ 50 days
$ at $1.50/A100-hr
$ ≈ 50 · 24 · 8 · 1.50 = $14,400

For the user to justify $14,400, the predicted gain (from a Stage 3 scaling fit) must be both quantified and tied to a concrete benefit (publication, deployable improvement, downstream contract). "I think this will help" does not justify it.

Recommended implementation

Lives at template/curry_train/prevalidate/compute_budget.py. Sketch:

def estimate_run_cost(N_params, T_tokens, gpus, gpu_tflops, mfu, dollar_per_gpu_hour):
    flops = 6 * N_params * T_tokens
    aggregate_tflops = gpus * gpu_tflops * mfu
    wall_seconds = flops / (aggregate_tflops * 1e12)
    wall_hours = wall_seconds / 3600
    cost = wall_hours * gpus * dollar_per_gpu_hour
    return {
        "flops": flops,
        "wall_hours": wall_hours,
        "wall_days": wall_hours / 24,
        "dollar_cost": cost,
    }

Add a decision_summary(estimate, predicted_gain, gain_uncertainty, baseline_to_beat) helper that produces a one-line verdict.

Procedure when assisting a user

Get the four inputs from the user: N, T, GPU type/count, and MFU (0.3–0.5 default for dense transformer, lower for unconventional).
Compute and present the cost estimate as a small markdown table: FLOPs, wall hours, days, $.
Ask what predicted gain the run is supposed to produce — pull from stage3-scaling-fit or stage3-small-scale-ablation.
Render the decision: gain per dollar, comparison to alternatives. Be honest if the answer is "the predicted gain isn't worth this cost".
Suggest cheaper alternatives if the answer is no:
- Smaller scale (4× cheaper, same gap probably visible).
- Shorter run (try 30B tokens before 100B).
- Distillation from an existing larger model.
- Different idea (use the compute for something with a better expected gain).

Boundaries

Estimates are upper bounds on speed and lower bounds on cost — real MFU is often less than estimated, especially for the first run on a new system.
This skill does not replace the stage3-kill-criterion skill. Even with a good budget, every run needs an abort condition.
Memory estimation is non-trivial for sequence-parallel and pipeline-parallel; for V1, use rough approximations and sanity-check with a 1-step bench.

Common mistakes

Using peak GPU FLOPS (e.g., 312 TFLOPS for A100) without applying MFU → 2–4× overestimate of throughput.
Forgetting that 6 N T assumes activation recomputation; without recompute, it's lower; with multi-token sampling, higher.
Quoting only wall time, not dollar cost — wall time alone hides the opportunity cost.
Not budgeting for failures — assume 1.5× nominal cost to cover a re-run.

skills/stage3-scaling-fit — produces the predicted gain that compute budget weighs against.
skills/stage3-kill-criterion — defines when to abort partway through.
skills/stage4-capacity-sweep — determines N.
skills/stage5-checkpoint-cadence — checkpoint cadence affects whether a partial run has salvageable value.
template/curry_train/prevalidate/compute_budget.py.

stage3-compute-budget

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

stage3-compute-budget

Tool Access

Preview

SKILL.md

Stage 3 · Pre-validate · Compute budget estimator

Stage question

What to estimate

The decision

Worked example

Recommended implementation

Procedure when assisting a user

Boundaries

Common mistakes

Related

Similar Skills

Help us improve

Stage 3 · Pre-validate · Compute budget estimator

Stage question

What to estimate

The decision

Worked example

Recommended implementation

Procedure when assisting a user

Boundaries

Common mistakes

Related