Help us improve
Share bugs, ideas, or general feedback.
From coreai-skills
Explores weight quantization and palettization for PyTorch models using coreai-opt to compare accuracy vs size tradeoffs.
npx claudepluginhub apple/coreai-models --plugin coreai-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/coreai-skills:model-compression-explorationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Systematically explore weight-only compression configurations for a PyTorch model using `coreai_opt`. The goal is to present the user with a clear overview of accuracy-vs-size tradeoff options across quantization and palettization, organized into three experiment groups.
Optimizes ML inference latency via model compression, distillation, pruning, quantization, caching strategies, and edge deployment patterns.
Provides empirical rules for authoring PyTorch models targeting on-device execution on Apple platforms (Neural Engine, GPU). Covers op compatibility, BC1S layout, KV cache patterns, correctness testing via PSNR, and common debugging issues.
Builds an operator-level compute template for an LLM, estimating FLOPs, tensor shapes, MFU, and parallelism trade-offs for serving configurations.
Share bugs, ideas, or general feedback.
Systematically explore weight-only compression configurations for a PyTorch model using coreai_opt. The goal is to present the user with a clear overview of accuracy-vs-size tradeoff options across quantization and palettization, organized into three experiment groups.
| File | Contents |
|---|---|
compression_patterns.md | Empirical patterns: what works, what doesn't, and why |
size_estimation.md | How to compute theoretical compressed model size |
experiment_runner.md | Memory-safe experiment loop, helpers, average bitwidth |
output_report.md | How to format and organize the output produced |
The deterministic helpers are unit-tested and importable. Prefer them over hand-rolled equivalents — they encode formulas and edge cases that have already been debugged.
| Script | Purpose |
|---|---|
scripts/compression_metrics.py | Theoretical size, average bitwidth, divisibility, parametrize walk |
scripts/quality_metrics.py | PSNR / SNR / IoU and a per-output dispatcher |
CoreAI Opt (coreai-opt) is a package that helps with model compression and model optimization in a hardware-aware manner.
For the full coreai-opt documentation, fetch:
https://apple.github.io/coreai-optimization/llms-full.txt
Check to see that the package is installed in the current python scope (venv, conda env). The package is called coreai-opt and is imported as coreai_opt. If it is not installed, prompt the user to install it.
For API verification at runtime, use help(coreai_opt) or inspect to confirm current signatures.
The user has to provide information on how to load the model, how to perform a forward pass, what are the inputs to be used and how to check the quality of the outputs. This information is very important to acquire from the user, since every model is different and making assumptions can lead us to meaningless results. For example, choosing to use random inputs instead of a valid input can result in the output quality being meaningless.
get_model() -> nn.Module - How to load/create the model (imports, weights, model class).get_reference_data() -> tuple[torch.Tensor, ...] | dict[str, torch.Tensor] - A representative batch of real inputs. Even 1-3 real samples suffice — random inputs produce meaningless PSNR because they don't exercise learned weight structure.get_model()(*get_reference_data()) (or the dict-spread equivalent get_model()(**get_reference_data())) actually runs. If neither works, ask the user how to invoke the model end-to-end.get_quality_metric(model_out) -> list[str] - we need to understand what metric to use for checking the quality of each output against the uncompressed output. Ask the user for each output produced by the model, should we use one of {"psnr", "snr", "iou"}. For example, if we have a mask as an output, PSNR isn't the right metric. IoU is a right metric.This information is required to proceed to the next step.
_call_model helper in references/experiment_runner.md handles both tuple- and dict-shaped reference data; reuse it everywhere you call the model.Take the default global weight quantization preset QuantizerConfig.presets.w8() (graph mode is the default). Apply it to a fresh model and time a single forward pass through the prepared model. If Quantizer.prepare(...) errors — e.g., torch.export guard failure, dynamic control flow — fall back to QuantizerConfig.presets.w8(execution_mode=ExecutionMode.EAGER) and time again. The mode that succeeded here is the mode you should use for the entire sweep, so the timing reflects real wall-clock cost. Record this mode and reuse it.
The single elapsed time becomes avg_quant_time. Pass quantizer=quantizer to extract_layer_specs(...) so it can read graph-mode FQ metadata via Quantizer._get_fake_quantize_modules(); otherwise the walker won't see graph-mode quantization and would misreport every layer as fp16.
Take the default palettization preset KMeansPalettizerConfig.presets.w6() (6-bit, per-grouped-channel, group_size=16). Apply the palettization config and run a forward pass while calculating the time it takes to compute a palettized model pass. This will be the average time it takes to run a single palettization pass: avg_palett_time. Palettization is eager-only — there's no graph/eager fallback to do here.
Below, we enumerate 3 groups of config options, totaling around ~15 quant configs and ~15 palett configs. Estimate the time required as avg_quant_time * 15 + avg_palett_time * 15.
Ask the user if this time estimate is in-line with their expectation before proceeding. Use the AskUserQuestion tool here to provide the estimate and ask if it is okay to proceed, or if they want to cut short the time.
Use coreai_opt.quantization.Quantizer for Groups 1-2 and coreai_opt.palettization.KMeansPalettizer for Group 3. Run the loop in references/experiment_runner.md (memory-safe, plus the canonical extract_layer_specs(prepared, quantizer=compressor) pattern that works in both modes). The execution mode was decided once in Step 3 — use that mode for every config in the sweep.
For each config:
prepare() by loading the configreferences/size_estimation.mdeval() mode with no_gradresults.jsonl (Output Report section)finalize(). Calibration is not needed for weight-only compression.Build configs through QuantizerConfig.presets / KMeansPalettizerConfig.presets where the shape matches; for the variations they don't cover (asymmetric, symmetric_with_clipping, alternative block sizes, enable_per_channel_scale=True), see references/experiment_runner.md for the spec-construction patterns. Verify the preset namespace at runtime with dir(QuantizerConfig.presets) and dir(KMeansPalettizerConfig.presets) — new presets are added over time.
Cross-product of {int8, int4} × {symmetric, asymmetric, symmetric_with_clipping}, all per-channel. The two symmetric corners match QuantizerConfig.presets.w8() and .w4() directly; the other four are variations that swap qscheme=.
Cross-product of {block_size: 16, 32, 128} × {symmetric, asymmetric, symmetric_with_clipping}, all int4 per-block. The block_size=32, symmetric corner matches QuantizerConfig.presets.w4_per_block(block_size=32); the rest swap block_size= and qscheme=.
Scale overhead reminder: per-block stores one fp16 scale per block. At block_size=16 with int4, effective bitwidth is ~5 bits/weight — account for this in compute_average_bitwidth (it already does).
Cross-product of {(8-bit, per-tensor), (6-bit, per-tensor), (6-bit, gs=4|8|16), (4-bit, gs=4|8|16)} × {enable_per_channel_scale: True, False} minus the one undefined entry (8-bit per-tensor with enable_per_channel_scale=True is sometimes folded into the 8-bit per-tensor row — keep both for completeness, totaling 15). The (8-bit, per-tensor, False) corner matches KMeansPalettizerConfig.presets.w8(); (6-bit, gs=16, False) matches presets.w6(); (4-bit, gs=16, False) matches presets.w4().
After the main sweep within a group:
set_module_name overrides on top of the seed's preset. Refinement runs inherit any divisibility overrides from the seed — don't rebuild the config from scratch.model.named_children() to enumerate top-level submodules. Boundary layers exist within each submodule; the "first/last layer" skip should consider entry/exit projections of each major child, not only the outermost first/last of the whole model.Use a JSON structure to track all the details of the experiment. We want to track the following:
{
"group": "2",
"config": {
"name": "palette_grouped_gs4_6bit_pcs0_skip-Embedding",
"path": "path/to/config",
},
"time_taken": 1000,
"output_quality_metrics": [
{"name": "bbox", "metric": "iou", "value": 0.7},
{"name": "logits", "metric": "psnr", "value": 16}
],
"compression_metrics": {
"average_bitwidth": 5,
"compression_ratio": 1.7,
"theoretical_model_size": 402
}
}
After all sweeps complete, the JSONL holds 40-50 records — too many to surface to the user verbatim. For each group, pick exactly 5 configs that span the accuracy-vs-size tradeoff and put only those in the report. Concretely, after filtering out configs that errored or fell below the floor (PSNR < 10 dB / IoU < 0.1):
(quality, ratio) points are furthest from the line connecting (1) and (2). If two configs have nearly identical (quality, ratio), prefer the one with the simpler config (fewer overrides, larger block/group size).The goal is that a reader scanning the table can see the shape of the tradeoff in one glance: not 30 indistinguishable rows, but 5 anchors covering the frontier from "barely compressed, near-perfect quality" to "maximum compression, quality at the floor". If a group has fewer than 5 survivors, surface them all and note the count.
Produce one report per group with these columns:
| Config | PSNR (dB) | Avg Bitwidth | Compression Ratio |
|---|
Refer to references/output_report.md for more details on output formatting. Following this format consistently keeps the qualitative picture comparable across runs; consumers grep these tables to compare model variants.
Generate a PSNR-vs-compression-ratio scatter plot (matplotlib) with annotated config names. Save as compression_exploration.png and include the link in the report.
The full sweep is ~30 main-sweep configs + ~30 refinement configs × per-config-time. Launch one subagent per group (1a, 1b, 2) so they run in parallel:
Each agent appends to a shared results.jsonl file (one JSON record per line). JSONL append is safe in practice when each agent writes one complete line at a time — use a flush after each write. The main agent uses /loop 5m (a slash command in this repo's plugin set that re-runs a prompt on a schedule) to read results.jsonl and report per-group completed/total to the user. Long sweeps look hung without progress signals; surface counts so the user can ctrl-C if something is clearly broken.
See references/experiment_runner.md for the append_record() and status_snapshot() helpers.
Read references/compression_patterns.md for the full list. Critical ones:
check_divisibility() and override with per-channel.execution_mode=ExecutionMode.EAGER for the whole sweep. Always pass quantizer= to extract_layer_specs(...) so it works in either mode.