CoreML On-Device Machine Learning

Overview

CoreML enables on-device machine learning inference across all Apple platforms. It abstracts hardware details while leveraging Apple Silicon's CPU, GPU, and Neural Engine for high-performance, private, and efficient execution.

Key principle: Start with the simplest approach, then optimize based on profiling. Don't over-engineer compression or caching until you have real performance data.

Decision Tree - CoreML vs Foundation Models

Need on-device ML?
  ├─ Text generation (LLM)?
  │   ├─ Simple prompts, structured output? → Foundation Models (ios-ai skill)
  │   └─ Custom model, fine-tuned, specific architecture? → CoreML
  ├─ Custom trained model?
  │   └─ Yes → CoreML
  ├─ Image/audio/sensor processing?
  │   └─ Yes → CoreML
  └─ Apple's built-in intelligence?
      └─ Yes → Foundation Models (ios-ai skill)

Red Flags

Use this skill when you see:

"Convert PyTorch model to CoreML"
"Model too large for device"
"Slow inference performance"
"LLM on-device"
"KV-cache" or "stateful model"
"Model compression" or "quantization"
MLModel, MLTensor, or coremltools in context

Pattern 1 - Basic Model Conversion

The standard PyTorch → CoreML workflow.

import coremltools as ct
import torch

# Trace the model
model.eval()
traced_model = torch.jit.trace(model, example_input)

# Convert to CoreML
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.TensorType(shape=example_input.shape)],
    minimum_deployment_target=ct.target.iOS18
)

# Save
mlmodel.save("MyModel.mlpackage")

Critical: Always set minimum_deployment_target to enable latest optimizations.

Pattern 2 - Model Compression (Post-Training)

Three techniques, each with different tradeoffs:

Palettization (Best for Neural Engine)

Clusters weights into lookup tables. Use per-grouped-channel for better accuracy.

from coremltools.optimize.coreml import (
    OpPalettizerConfig,
    OptimizationConfig,
    palettize_weights
)

# 4-bit with grouped channels (iOS 18+)
op_config = OpPalettizerConfig(
    mode="kmeans",
    nbits=4,
    granularity="per_grouped_channel",
    group_size=16
)

config = OptimizationConfig(global_config=op_config)
compressed_model = palettize_weights(model, config)

Bits	Compression	Accuracy Impact
8-bit	2x	Minimal
6-bit	2.7x	Low
4-bit	4x	Moderate (use grouped channels)
2-bit	8x	High (requires training-time)

Quantization (Best for GPU on Mac)

Linear mapping to INT8/INT4. Use per-block for better accuracy.

from coremltools.optimize.coreml import (
    OpLinearQuantizerConfig,
    OptimizationConfig,
    linear_quantize_weights
)

# INT4 per-block quantization (iOS 18+)
op_config = OpLinearQuantizerConfig(
    mode="linear",
    dtype="int4",
    granularity="per_block",
    block_size=32
)

config = OptimizationConfig(global_config=op_config)
compressed_model = linear_quantize_weights(model, config)

Pruning (Combine with other techniques)

Sets weights to zero for sparse representation. Can combine with palettization.

from coremltools.optimize.coreml import (
    OpMagnitudePrunerConfig,
    OptimizationConfig,
    prune_weights
)

op_config = OpMagnitudePrunerConfig(
    target_sparsity=0.4  # 40% zeros
)

config = OptimizationConfig(global_config=op_config)
sparse_model = prune_weights(model, config)

Pattern 3 - Training-Time Compression

When post-training compression loses too much accuracy, fine-tune with compression.

from coremltools.optimize.torch.palettization import (
    DKMPalettizerConfig,
    DKMPalettizer
)

# Configure 4-bit palettization
config = DKMPalettizerConfig(global_config={"n_bits": 4})

# Prepare model
palettizer = DKMPalettizer(model, config)
prepared_model = palettizer.prepare()

# Fine-tune (your training loop)
for epoch in range(num_epochs):
    train_epoch(prepared_model, data_loader)
    palettizer.step()

# Finalize
final_model = palettizer.finalize()

Tradeoff: Better accuracy than post-training, but requires training data and time.

Pattern 4 - Calibration-Based Compression (iOS 18+)

Middle ground: uses calibration data without full training.

from coremltools.optimize.torch.pruning import (
    MagnitudePrunerConfig,
    LayerwiseCompressor
)

# Configure
config = MagnitudePrunerConfig(
    target_sparsity=0.4,
    n_samples=128  # Calibration samples
)

# Create pruner
pruner = LayerwiseCompressor(model, config)

# Calibrate
sparse_model = pruner.compress(calibration_data_loader)

Pattern 5 - Stateful Models (KV-Cache for LLMs)

For transformer models, use state to avoid recomputing key/value vectors.

PyTorch Model with State

class StatefulLLM(nn.Module):
    def __init__(self):
        super().__init__()
        # Register state buffers
        self.register_buffer("keyCache", torch.zeros(batch, heads, seq_len, dim))
        self.register_buffer("valueCache", torch.zeros(batch, heads, seq_len, dim))

    def forward(self, input_ids, causal_mask):
        # Update caches in-place during forward
        # ... attention with KV-cache ...
        return logits

Conversion with State

import coremltools as ct

mlmodel = ct.convert(
    traced_model,
    inputs=[
        ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 2048))),
        ct.TensorType(name="causal_mask", shape=(1, 1, ct.RangeDim(1, 2048), ct.RangeDim(1, 2048)))
    ],
    states=[
        ct.StateType(name="keyCache", ...),
        ct.StateType(name="valueCache", ...)
    ],
    minimum_deployment_target=ct.target.iOS18
)

Using State at Runtime

// Create state from model
let state = model.makeState()

// Run prediction with state (updated in-place)
let output = try model.prediction(from: input, using: state)

Performance: 1.6x speedup on Mistral-7B (M3 Max) compared to manual KV-cache I/O.

Pattern 6 - Multi-Function Models (Adapters/LoRA)

Deploy multiple adapters in a single model, sharing base weights.

from coremltools.models import MultiFunctionDescriptor
from coremltools.models.utils import save_multifunction

# Convert individual models
sticker_model = ct.convert(sticker_adapter_model, ...)
storybook_model = ct.convert(storybook_adapter_model, ...)

# Save individually
sticker_model.save("sticker.mlpackage")
storybook_model.save("storybook.mlpackage")

# Merge with shared weights
desc = MultiFunctionDescriptor()
desc.add_function("sticker", "sticker.mlpackage")
desc.add_function("storybook", "storybook.mlpackage")

save_multifunction(desc, "MultiAdapter.mlpackage")

Loading Specific Function

let config = MLModelConfiguration()
config.functionName = "sticker"  // or "storybook"

let model = try MLModel(contentsOf: modelURL, configuration: config)

Pattern 7 - MLTensor for Pipeline Stitching (iOS 18+)

Simplifies computation between models (decoding, post-processing).

import CoreML

// Create tensors
let scores = MLTensor(shape: [1, vocab_size], scalars: logits)

// Operations (executed asynchronously on Apple Silicon)
let topK = scores.topK(k: 10)
let probs = (topK.values / temperature).softmax()

// Sample from distribution
let sampled = probs.multinomial(numSamples: 1)

// Materialize to access data (blocks until complete)
let shapedArray = await sampled.shapedArray(of: Int32.self)

Key insight: MLTensor operations are async. Call shapedArray() to materialize results.

Pattern 8 - Async Prediction for Concurrency

Thread-safe concurrent predictions for throughput.

class ImageProcessor {
    let model: MLModel

    func processImages(_ images: [CGImage]) async throws -> [Output] {
        try await withThrowingTaskGroup(of: Output.self) { group in
            for image in images {
                group.addTask {
                    // Check cancellation before expensive work
                    try Task.checkCancellation()

                    let input = try self.prepareInput(image)
                    // Async prediction - thread safe!
                    return try await self.model.prediction(from: input)
                }
            }

            return try await group.reduce(into: []) { $0.append($1) }
        }
    }
}

Warning: Limit concurrent predictions to avoid memory pressure from multiple input/output buffers.

// Limit concurrency
let semaphore = AsyncSemaphore(value: 2)

for image in images {
    group.addTask {
        await semaphore.wait()
        defer { semaphore.signal() }
        return try await process(image)
    }
}

Anti-Patterns

Don't - Load models on main thread at launch

// BAD - blocks UI
class AppDelegate {
    let model = try! MLModel(contentsOf: url)  // Blocks!
}

// GOOD - lazy async loading
class ModelManager {
    private var model: MLModel?

    func getModel() async throws -> MLModel {
        if let model { return model }
        model = try await Task.detached {
            try MLModel(contentsOf: url)
        }.value
        return model!
    }
}

Don't - Reload model for each prediction

// BAD - reloads every time
func predict(_ input: Input) throws -> Output {
    let model = try MLModel(contentsOf: url)  // Expensive!
    return try model.prediction(from: input)
}

// GOOD - keep model loaded
class Predictor {
    private let model: MLModel

    func predict(_ input: Input) throws -> Output {
        try model.prediction(from: input)
    }
}

Don't - Compress without profiling first

// BAD - blind compression
let compressed = palettize_weights(model, 2bit_config)  // May break accuracy!

// GOOD - profile, then compress iteratively
// 1. Profile Float16 baseline
// 2. Try 8-bit → check accuracy
// 3. Try 6-bit → check accuracy
// 4. Try 4-bit with grouped channels → check accuracy
// 5. Only use 2-bit with training-time compression

Don't - Ignore deployment target

# BAD - misses optimizations
mlmodel = ct.convert(traced_model, inputs=[...])

# GOOD - enables SDPA fusion, per-block quantization, etc.
mlmodel = ct.convert(
    traced_model,
    inputs=[...],
    minimum_deployment_target=ct.target.iOS18
)

Pressure Scenarios

Scenario 1 - "Model is 5GB, need it under 2GB for iPhone"

Wrong approach: Jump straight to 2-bit palettization.

Right approach:

Start with 8-bit palettization → check accuracy
Try 6-bit → check accuracy
Try 4-bit with per_grouped_channel → check accuracy
If still too large, use calibration-based compression
If still losing accuracy, use training-time compression

Scenario 2 - "LLM inference is too slow"

Wrong approach: Try different compute units randomly.

Right approach:

Profile with Core ML Instrument
Check if load is cached (look for "cached" vs "prepare and cache")
Enable stateful KV-cache
Check SDPA optimization is enabled (iOS 18+ deployment target)
Consider INT4 quantization for GPU on Mac

Scenario 3 - "Need multiple LoRA adapters in one app"

Wrong approach: Ship separate models for each adapter.

Right approach:

Convert each adapter model separately
Use MultiFunctionDescriptor to merge with shared base
Load specific function via config.functionName
Weights are deduplicated automatically

Checklist

Before deploying a CoreML model:

Set minimum_deployment_target to latest supported iOS
Profile baseline Float16 performance
Check if model load is cached
Consider compression only if size/performance requires it
Test accuracy after each compression step
Use async prediction for concurrent workloads
Limit concurrent predictions to manage memory
Use state for transformer KV-cache
Use multi-function for adapter variants

Resources

WWDC: 2023-10047, 2023-10049, 2024-10159, 2024-10161

Docs: /coreml, /coreml/mlmodel, /coreml/mltensor

Skills: coreml-ref, coreml-diag, axiom-ios-ai (Foundation Models)

coreml

CoreML On-Device Machine Learning

Overview

Decision Tree - CoreML vs Foundation Models

Red Flags

Pattern 1 - Basic Model Conversion

Pattern 2 - Model Compression (Post-Training)

Palettization (Best for Neural Engine)

Quantization (Best for GPU on Mac)

Pruning (Combine with other techniques)

Pattern 3 - Training-Time Compression

Pattern 4 - Calibration-Based Compression (iOS 18+)

Pattern 5 - Stateful Models (KV-Cache for LLMs)

PyTorch Model with State

Conversion with State

Using State at Runtime

Pattern 6 - Multi-Function Models (Adapters/LoRA)

Loading Specific Function

Pattern 7 - MLTensor for Pipeline Stitching (iOS 18+)

Pattern 8 - Async Prediction for Concurrency

Anti-Patterns

Don't - Load models on main thread at launch

Don't - Reload model for each prediction

Don't - Compress without profiling first

Don't - Ignore deployment target

Pressure Scenarios

Scenario 1 - "Model is 5GB, need it under 2GB for iPhone"

Scenario 2 - "LLM inference is too slow"

Scenario 3 - "Need multiple LoRA adapters in one app"

Checklist

Resources

Similar Skills