Use when deploying custom ML models on-device, converting PyTorch models, compressing models, implementing LLM inference, or optimizing CoreML performance. Covers model conversion, compression, stateful models, KV-cache, multi-function models, MLTensor.
/plugin marketplace add CharlesWiltgen/Axiom/plugin install axiom@axiom-marketplaceThis skill inherits all available tools. When active, it can use any tool Claude has access to.
CoreML enables on-device machine learning inference across all Apple platforms. It abstracts hardware details while leveraging Apple Silicon's CPU, GPU, and Neural Engine for high-performance, private, and efficient execution.
Key principle: Start with the simplest approach, then optimize based on profiling. Don't over-engineer compression or caching until you have real performance data.
Need on-device ML?
├─ Text generation (LLM)?
│ ├─ Simple prompts, structured output? → Foundation Models (ios-ai skill)
│ └─ Custom model, fine-tuned, specific architecture? → CoreML
├─ Custom trained model?
│ └─ Yes → CoreML
├─ Image/audio/sensor processing?
│ └─ Yes → CoreML
└─ Apple's built-in intelligence?
└─ Yes → Foundation Models (ios-ai skill)
Use this skill when you see:
The standard PyTorch → CoreML workflow.
import coremltools as ct
import torch
# Trace the model
model.eval()
traced_model = torch.jit.trace(model, example_input)
# Convert to CoreML
mlmodel = ct.convert(
traced_model,
inputs=[ct.TensorType(shape=example_input.shape)],
minimum_deployment_target=ct.target.iOS18
)
# Save
mlmodel.save("MyModel.mlpackage")
Critical: Always set minimum_deployment_target to enable latest optimizations.
Three techniques, each with different tradeoffs:
Clusters weights into lookup tables. Use per-grouped-channel for better accuracy.
from coremltools.optimize.coreml import (
OpPalettizerConfig,
OptimizationConfig,
palettize_weights
)
# 4-bit with grouped channels (iOS 18+)
op_config = OpPalettizerConfig(
mode="kmeans",
nbits=4,
granularity="per_grouped_channel",
group_size=16
)
config = OptimizationConfig(global_config=op_config)
compressed_model = palettize_weights(model, config)
| Bits | Compression | Accuracy Impact |
|---|---|---|
| 8-bit | 2x | Minimal |
| 6-bit | 2.7x | Low |
| 4-bit | 4x | Moderate (use grouped channels) |
| 2-bit | 8x | High (requires training-time) |
Linear mapping to INT8/INT4. Use per-block for better accuracy.
from coremltools.optimize.coreml import (
OpLinearQuantizerConfig,
OptimizationConfig,
linear_quantize_weights
)
# INT4 per-block quantization (iOS 18+)
op_config = OpLinearQuantizerConfig(
mode="linear",
dtype="int4",
granularity="per_block",
block_size=32
)
config = OptimizationConfig(global_config=op_config)
compressed_model = linear_quantize_weights(model, config)
Sets weights to zero for sparse representation. Can combine with palettization.
from coremltools.optimize.coreml import (
OpMagnitudePrunerConfig,
OptimizationConfig,
prune_weights
)
op_config = OpMagnitudePrunerConfig(
target_sparsity=0.4 # 40% zeros
)
config = OptimizationConfig(global_config=op_config)
sparse_model = prune_weights(model, config)
When post-training compression loses too much accuracy, fine-tune with compression.
from coremltools.optimize.torch.palettization import (
DKMPalettizerConfig,
DKMPalettizer
)
# Configure 4-bit palettization
config = DKMPalettizerConfig(global_config={"n_bits": 4})
# Prepare model
palettizer = DKMPalettizer(model, config)
prepared_model = palettizer.prepare()
# Fine-tune (your training loop)
for epoch in range(num_epochs):
train_epoch(prepared_model, data_loader)
palettizer.step()
# Finalize
final_model = palettizer.finalize()
Tradeoff: Better accuracy than post-training, but requires training data and time.
Middle ground: uses calibration data without full training.
from coremltools.optimize.torch.pruning import (
MagnitudePrunerConfig,
LayerwiseCompressor
)
# Configure
config = MagnitudePrunerConfig(
target_sparsity=0.4,
n_samples=128 # Calibration samples
)
# Create pruner
pruner = LayerwiseCompressor(model, config)
# Calibrate
sparse_model = pruner.compress(calibration_data_loader)
For transformer models, use state to avoid recomputing key/value vectors.
class StatefulLLM(nn.Module):
def __init__(self):
super().__init__()
# Register state buffers
self.register_buffer("keyCache", torch.zeros(batch, heads, seq_len, dim))
self.register_buffer("valueCache", torch.zeros(batch, heads, seq_len, dim))
def forward(self, input_ids, causal_mask):
# Update caches in-place during forward
# ... attention with KV-cache ...
return logits
import coremltools as ct
mlmodel = ct.convert(
traced_model,
inputs=[
ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 2048))),
ct.TensorType(name="causal_mask", shape=(1, 1, ct.RangeDim(1, 2048), ct.RangeDim(1, 2048)))
],
states=[
ct.StateType(name="keyCache", ...),
ct.StateType(name="valueCache", ...)
],
minimum_deployment_target=ct.target.iOS18
)
// Create state from model
let state = model.makeState()
// Run prediction with state (updated in-place)
let output = try model.prediction(from: input, using: state)
Performance: 1.6x speedup on Mistral-7B (M3 Max) compared to manual KV-cache I/O.
Deploy multiple adapters in a single model, sharing base weights.
from coremltools.models import MultiFunctionDescriptor
from coremltools.models.utils import save_multifunction
# Convert individual models
sticker_model = ct.convert(sticker_adapter_model, ...)
storybook_model = ct.convert(storybook_adapter_model, ...)
# Save individually
sticker_model.save("sticker.mlpackage")
storybook_model.save("storybook.mlpackage")
# Merge with shared weights
desc = MultiFunctionDescriptor()
desc.add_function("sticker", "sticker.mlpackage")
desc.add_function("storybook", "storybook.mlpackage")
save_multifunction(desc, "MultiAdapter.mlpackage")
let config = MLModelConfiguration()
config.functionName = "sticker" // or "storybook"
let model = try MLModel(contentsOf: modelURL, configuration: config)
Simplifies computation between models (decoding, post-processing).
import CoreML
// Create tensors
let scores = MLTensor(shape: [1, vocab_size], scalars: logits)
// Operations (executed asynchronously on Apple Silicon)
let topK = scores.topK(k: 10)
let probs = (topK.values / temperature).softmax()
// Sample from distribution
let sampled = probs.multinomial(numSamples: 1)
// Materialize to access data (blocks until complete)
let shapedArray = await sampled.shapedArray(of: Int32.self)
Key insight: MLTensor operations are async. Call shapedArray() to materialize results.
Thread-safe concurrent predictions for throughput.
class ImageProcessor {
let model: MLModel
func processImages(_ images: [CGImage]) async throws -> [Output] {
try await withThrowingTaskGroup(of: Output.self) { group in
for image in images {
group.addTask {
// Check cancellation before expensive work
try Task.checkCancellation()
let input = try self.prepareInput(image)
// Async prediction - thread safe!
return try await self.model.prediction(from: input)
}
}
return try await group.reduce(into: []) { $0.append($1) }
}
}
}
Warning: Limit concurrent predictions to avoid memory pressure from multiple input/output buffers.
// Limit concurrency
let semaphore = AsyncSemaphore(value: 2)
for image in images {
group.addTask {
await semaphore.wait()
defer { semaphore.signal() }
return try await process(image)
}
}
// BAD - blocks UI
class AppDelegate {
let model = try! MLModel(contentsOf: url) // Blocks!
}
// GOOD - lazy async loading
class ModelManager {
private var model: MLModel?
func getModel() async throws -> MLModel {
if let model { return model }
model = try await Task.detached {
try MLModel(contentsOf: url)
}.value
return model!
}
}
// BAD - reloads every time
func predict(_ input: Input) throws -> Output {
let model = try MLModel(contentsOf: url) // Expensive!
return try model.prediction(from: input)
}
// GOOD - keep model loaded
class Predictor {
private let model: MLModel
func predict(_ input: Input) throws -> Output {
try model.prediction(from: input)
}
}
// BAD - blind compression
let compressed = palettize_weights(model, 2bit_config) // May break accuracy!
// GOOD - profile, then compress iteratively
// 1. Profile Float16 baseline
// 2. Try 8-bit → check accuracy
// 3. Try 6-bit → check accuracy
// 4. Try 4-bit with grouped channels → check accuracy
// 5. Only use 2-bit with training-time compression
# BAD - misses optimizations
mlmodel = ct.convert(traced_model, inputs=[...])
# GOOD - enables SDPA fusion, per-block quantization, etc.
mlmodel = ct.convert(
traced_model,
inputs=[...],
minimum_deployment_target=ct.target.iOS18
)
Wrong approach: Jump straight to 2-bit palettization.
Right approach:
per_grouped_channel → check accuracyWrong approach: Try different compute units randomly.
Right approach:
Wrong approach: Ship separate models for each adapter.
Right approach:
MultiFunctionDescriptor to merge with shared baseconfig.functionNameBefore deploying a CoreML model:
minimum_deployment_target to latest supported iOSWWDC: 2023-10047, 2023-10049, 2024-10159, 2024-10161
Docs: /coreml, /coreml/mlmodel, /coreml/mltensor
Skills: coreml-ref, coreml-diag, axiom-ios-ai (Foundation Models)
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.