Help us improve
Share bugs, ideas, or general feedback.
From coreai-skills
Provides empirical rules for authoring PyTorch models targeting on-device execution on Apple platforms (Neural Engine, GPU). Covers op compatibility, BC1S layout, KV cache patterns, correctness testing via PSNR, and common debugging issues.
npx claudepluginhub apple/coreai-models --plugin coreai-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/coreai-skills:model-authoringThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill contains the hard-won empirical knowledge for making PyTorch models compile and run correctly on Apple hardware via Core AI. The rules here are stable across Core AI releases — they reflect hardware behavior, not API shapes.
Exports PyTorch models with coreai-torch, compiles with coreai-build, and runs on Apple silicon via Core AI runtime (Swift/Python).
Guide for selecting and deploying on-device AI on Apple platforms: Foundation Models, Core ML, MLX Swift, and llama.cpp. Covers model conversion, quantization, structured output, and Neural Engine optimization.
Optimizes ML inference latency via model compression, distillation, pruning, quantization, caching strategies, and edge deployment patterns.
Share bugs, ideas, or general feedback.
This skill contains the hard-won empirical knowledge for making PyTorch models compile and run correctly on Apple hardware via Core AI. The rules here are stable across Core AI releases — they reflect hardware behavior, not API shapes.
Use these resources on-demand — do not read all files upfront. Consult the relevant reference when the user's task requires specific patterns for a target platform, or when debugging.
| Resource | When to consult |
|---|---|
neural_engine_rules.md | Neural Engine patterns: BC1S layout, Conv2d projections, per-head attention, KV cache readonly pattern, stride/dilation/pooling rules, causal mask, RoPE, chunked prefill |
gpu_rules.md | GPU patterns: fused QKV, native SDPA, KV cache stateful pattern, MoE (GatherMM/SwitchLinear), memory-efficient loading, RMSNorm variants |
common_issues.md | Debugging: PSNR issues, compilation errors, runtime problems, stale flags |
| coreai-models repo | Complete working reference implementations for LLMs, vision, audio, diffusion. Explore primitives/ and models/ directories. |
For complex models (LLMs, MoE, multimodal, diffusion), explore the coreai-models repo before writing primitives from scratch. It has complete authoring primitives for both GPU and Neural Engine, including advanced patterns like iOS embedding quantization, MoE routing, and memory-efficient weight loading for large models. If the user has a local clone, explore it directly. If not, suggest cloning it.
Online docs: coreai-torch composite ops | externalization | composite ops API
Model optimization decisions (precision, compression, device compatibility) are resolved by the working-with-coreai skill.
Skill("coreai-skills:working-with-coreai") before authoring.| User talks about… | Likely compute unit | Why |
|---|---|---|
| Energy efficiency, battery life, iOS, iPhone, iPad, always-on | Neural Engine | Most energy-efficient compute unit |
| Max performance, throughput, macOS, large batches, flexibility | GPU | GPU excels at throughput and flexible workloads |
| Correctness testing, debugging, reference implementation | CPU | CPU runs everything, good for validation |
If the user explicitly names an accelerator (Neural Engine, GPU, CPU), use their choice. Otherwise, infer from context and use outcome-oriented language in your responses — say "optimized for energy-efficient inference on iPhone" rather than "targets Neural Engine". Mirror the user's vocabulary: if they say Neural Engine, match them.
| Compute unit | Strengths | Key authoring constraint |
|---|---|---|
| Neural Engine | Energy-efficient, battery-friendly, static workloads | BC1S layout, fp16 only, static shapes, limited op set |
| GPU | High throughput, large models, flexible ops | Standard PyTorch layout, supports fp32 |
| CPU | Small models, low overhead, low latency, correctness testing, fallback | Runs all ops, good for validation |
Quick reference for the key authoring differences. Consult neural_engine_rules.md or gpu_rules.md for full details.
| Aspect | Neural Engine | GPU |
|---|---|---|
| Tensor layout | BC1S (B, H*D, 1, S) | Standard (B, S, D) |
| Projections | nn.Conv2d(kernel_size=1) | nn.Linear (fused QKV on GPU) |
| Embedding shape | (V, 1, D) — externalized | Standard nn.Embedding |
| Attention | Per-head sequential | Fused native SDPA |
| Float precision | fp16 only — no fp32 literals anywhere | fp16 weights, fp32 intermediates OK |
| Shapes | Fully static | Dynamic shapes supported |
| Weight conversion | unsqueeze(-1).unsqueeze(-1) for Conv2d | No reshape needed |
Run code, don't read code. Running gives ground truth instantly.
register_forward_hook — capture intermediatesAuthor in this order — each depends on the previous:
| Comparison | Threshold | Meaning |
|---|---|---|
| Re-authored vs source (torch) | > 70 dB | Implementation correct |
| Neural Engine layout vs GPU layout (torch) | > 70 dB | Layout transformation correct |
| Compiled vs torch | >= 40 dB | Compilation precision (fp16 + optimizations) |
| After 4-bit palettization | >= 35 dB | Compression acceptable |
Verify each primitive individually before composing the full model. Also compare the full re-authored model's outputs against a baseline export (direct from HuggingFace without re-authoring) — both in Python and after compilation on device — to confirm end-to-end parity.
from_source_model classmethodEvery re-authored model gets a factory classmethod — no hardcoded constants:
@classmethod
def from_source_model(cls, source_model) -> "MyDecoder":
cfg = source_model.config
model = cls(
n_layers=cfg.num_hidden_layers,
hidden=cfg.hidden_size,
n_heads=cfg.num_attention_heads,
# ...
)
model.load_weights_from(source_model.state_dict())
return model
Both Neural Engine and GPU require explicit KV cache management, but the patterns differ:
| Compute unit | Cache shape | Sequence dim | Pattern | Details |
|---|---|---|---|---|
| Neural Engine | [n_layers, B, H_kv*D, 1, max_S] | dim 4 | Readonly functional I/O — model has no cache writes, returns new K/V tokens as outputs | neural_engine_rules.md |
| GPU | [n_layers, B, H_kv, max_S, D] | dim 3 | Stateful export wrapper — register_buffer for KV, hoistToArg at compile | gpu_rules.md |
Key rule: Do not use stateful transforms for token generation — state resets between inference calls. Use the readonly KV I/O pattern (Neural Engine) or the stateful export wrapper (GPU) instead.
Apply after authoring float16 model passes verification, before Core AI export.
For compression exploration and configuration, use Skill("coreai-skills:model-compression-exploration") which covers coreai-opt quantization and palettization sweeps.
Key facts for authoring:
._data, ._lut, ._indices suffixes after compression.aimodel, not in the PyTorch checkpoint| Bits | Size reduction | Typical PSNR | Flag if below |
|---|---|---|---|
| 8-bit | ~2x | > 55 dB | 50 dB |
| 4-bit | ~4x | ~40 dB | 35 dB |
| 2-bit | ~8x | ~25-35 dB | Usually unacceptable |