Help us improve
Share bugs, ideas, or general feedback.
From apple-kit-skills
Guide for selecting and deploying on-device AI on Apple platforms: Foundation Models, Core ML, MLX Swift, and llama.cpp. Covers model conversion, quantization, structured output, and Neural Engine optimization.
npx claudepluginhub dpearson2699/swift-ios-skills --plugin swiftui-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/apple-kit-skills:apple-on-device-aiThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple
Guides on-device AI model implementation in iOS apps using Foundation Models and MLX Swift for local LLM inference, VLMs, embeddings, image generation, tool calling, multi-turn conversations, custom models, and structured generation.
Integrates Apple's FoundationModels framework for on-device LLM features: text generation, guided output with @Generable, tool calling, and snapshot streaming in iOS 26+.
Integrates Apple's FoundationModels for on-device LLM in iOS 26+ apps: text generation, @Generable structured output, tool calling, snapshot streaming.
Share bugs, ideas, or general feedback.
Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple Foundation Models, Core ML, MLX Swift, and llama.cpp.
Use this decision tree to pick the right framework for your use case.
When to use: Text generation, summarization, entity extraction, structured output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence enabled. Zero setup -- no API keys, no network, no model downloads.
Best for:
@Generable typesTool protocolNot suited for: Complex math, code generation, factual accuracy tasks, or apps targeting pre-iOS 26 devices.
When to use: Deploying custom trained models (vision, NLP, audio) across all Apple platforms. Converting models from PyTorch, TensorFlow, or scikit-learn with coremltools.
Best for:
When to use: Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma) on Apple Silicon with maximum throughput. Research and prototyping.
Best for:
mlx-communityWhen to use: Cross-platform LLM inference using GGUF model format. Production deployments needing broad device support.
Best for:
| Scenario | Framework |
|---|---|
| Text generation, zero setup (iOS 26+) | Foundation Models |
| Structured output from on-device LLM | Foundation Models (@Generable) |
| Image classification, object detection | Core ML |
| Custom model from PyTorch/TensorFlow | Core ML + coremltools |
| Running specific open-source LLMs | MLX Swift or llama.cpp |
| Maximum throughput on Apple Silicon | MLX Swift |
| Cross-platform LLM inference | llama.cpp |
| OCR and text recognition | Vision framework |
| Sentiment analysis, NER, tokenization | Natural Language framework |
| Training custom classifiers on device | Create ML |
On-device language model optimized for Apple Silicon. Available on devices supporting Apple Intelligence (iOS 26+, macOS 26+).
contextSize for the limitsupportedLanguages for supported localesAlways check before using. Never crash on unavailability.
import FoundationModels
switch SystemLanguageModel.default.availability {
case .available:
// Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
// Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
// Model is downloading; show loading state
case .unavailable(.deviceNotEligible):
// Device cannot run Apple Intelligence; use fallback
default:
// Graceful fallback for any other reason
}
// Basic session
let session = LanguageModelSession()
// Session with instructions
let session = LanguageModelSession {
"You are a helpful cooking assistant."
}
// Session with tools
let session = LanguageModelSession(
tools: [weatherTool, recipeTool]
) {
"You are a helpful assistant with access to tools."
}
Key rules:
session.isResponding)session.prewarm() before user interaction for faster first responseLanguageModelSession(model: model, tools: [], transcript: savedTranscript)@GenerableThe @Generable macro creates compile-time schemas for type-safe output:
@Generable
struct Recipe {
@Guide(description: "The recipe name")
var name: String
@Guide(description: "Cooking steps", .count(3))
var steps: [String]
@Guide(description: "Prep time in minutes", .range(1...120))
var prepTime: Int
}
let response = try await session.respond(
to: "Suggest a quick pasta recipe",
generating: Recipe.self
)
print(response.content.name)
@Guide Constraints| Constraint | Purpose |
|---|---|
description: | Natural language hint for generation |
.anyOf([values]) | Restrict to enumerated string values |
.count(n) | Fixed array length |
.range(min...max) | Numeric range |
.minimum(n) / .maximum(n) | One-sided numeric bound |
.minimumCount(n) / .maximumCount(n) | Array length bounds |
.constant(value) | Always returns this value |
.pattern(regex) | String format enforcement |
.element(guide) | Guide applied to each array element |
Properties generate in declaration order. Place foundational data before dependent data for better results.
let stream = session.streamResponse(
to: "Suggest a recipe",
generating: Recipe.self
)
for try await snapshot in stream {
// snapshot.content is Recipe.PartiallyGenerated (all properties optional)
if let name = snapshot.content.name { updateNameLabel(name) }
}
struct WeatherTool: Tool {
let name = "weather"
let description = "Get current weather for a city."
@Generable
struct Arguments {
@Guide(description: "The city name")
var city: String
}
func call(arguments: Arguments) async throws -> String {
let weather = try await fetchWeather(arguments.city)
return weather.description
}
}
Register tools at session creation. The model invokes them autonomously.
do {
let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
switch error {
case .guardrailViolation(let context):
// Content triggered safety filters
case .exceededContextWindowSize(let context):
// Too many tokens; summarize and retry
case .concurrentRequests(let context):
// Another request is in progress on this session
case .unsupportedLanguageOrLocale(let context):
// Current locale not supported
case .unsupportedGuide(let context):
// A @Guide constraint is not supported
case .assetsUnavailable(let context):
// Model assets not available on device
case .refusal(let refusal, _):
// Model refused; stream refusal.explanation for details
case .rateLimited(let context):
// Too many requests; back off and retry
case .decodingFailure(let context):
// Response could not be decoded into the expected type
default: break
}
}
let options = GenerationOptions(
sampling: .random(top: 40),
temperature: 0.7,
maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)
Sampling modes: .greedy, .random(top:seed:), .random(probabilityThreshold:seed:).
tokenCount(for:) to monitor the context window budget[descriptive example]Foundation Models supports specialized use cases via SystemLanguageModel.UseCase:
.general -- Default for text generation, summarization, dialog.contentTagging -- Optimized for categorization and labeling tasksLoad fine-tuned adapters for specialized behavior (requires entitlement):
let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)
See references/foundation-models.md for the complete Foundation Models API reference.
Apple's framework for deploying trained models. Automatically dispatches to the optimal compute unit (CPU, GPU, or Neural Engine).
| Format | Extension | When to Use |
|---|---|---|
.mlpackage | Directory (mlprogram) | All new models (iOS 15+) |
.mlmodel | Single file (neuralnetwork) | Legacy only (iOS 11-14) |
.mlmodelc | Compiled | Pre-compiled for faster loading |
Always use mlprogram (.mlpackage) for new work.
import coremltools as ct
# PyTorch conversion (torch.jit.trace)
model.eval() # CRITICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")],
minimum_deployment_target=ct.target.iOS18,
convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")
| Technique | Size Reduction | Accuracy Impact | Best Compute Unit |
|---|---|---|---|
| INT8 per-channel | ~4x | Low | CPU/GPU |
| INT4 per-block | ~8x | Medium | GPU |
| Palettization 4-bit | ~8x | Low-Medium | Neural Engine |
| W8A8 (weights+activations) | ~4x | Low | ANE (A17 Pro/M4+) |
| Pruning 75% | ~4x | Medium | CPU/ANE |
let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)
// Async prediction (iOS 17+)
let output = try await model.prediction(from: input)
Swift type for multidimensional array operations:
import CoreML
let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()
See references/coreml-conversion.md for the full conversion pipeline and references/coreml-optimization.md for optimization techniques.
Apple's ML framework for Swift. Highest sustained generation throughput on Apple Silicon via unified memory architecture.
import MLX
import MLXLLM
let config = ModelConfiguration(id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: config)
try await model.perform { context in
let input = try await context.processor.prepare(
input: UserInput(prompt: "Hello")
)
let stream = try generate(
input: input,
parameters: GenerateParameters(temperature: 0.0),
context: context
)
for await part in stream {
print(part.chunk ?? "", terminator: "")
}
}
| Device | RAM | Recommended Model | RAM Usage |
|---|---|---|---|
| iPhone 12-14 | 4-6 GB | SmolLM2-135M or Qwen 2.5 0.5B | ~0.3 GB |
| iPhone 15 Pro+ | 8 GB | Gemma 3n E4B 4-bit | ~3.5 GB |
| Mac 8 GB | 8 GB | Llama 3.2 3B 4-bit | ~3 GB |
| Mac 16 GB+ | 16 GB+ | Mistral 7B 4-bit | ~6 GB |
MLX.GPU.set(cacheLimit: 512 * 1024 * 1024)See references/mlx-swift.md for full MLX Swift patterns and llama.cpp integration.
When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):
func respond(to prompt: String) async throws -> String {
if SystemLanguageModel.default.isAvailable {
return try await foundationModelsRespond(prompt)
} else if canLoadMLXModel() {
return try await mlxRespond(prompt)
} else {
throw AIError.noBackendAvailable
}
}
Serialize all model access through a coordinator actor to prevent contention:
actor ModelCoordinator {
func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
try await work()
}
}
session.prewarm() for Foundation Models before user interaction.mlmodelc for faster loadingperform() callLanguageModelSession() without checking
SystemLanguageModel.default.availability crashes on unsupported devices.tokenCount(for:) and summarize when needed.LanguageModelSession supports one
request at a time. Check session.isResponding or serialize access.model.eval() before Core ML tracing. PyTorch models must be
in eval mode before torch.jit.trace. Training-mode artifacts corrupt output.mlprogram (.mlpackage) for new
Core ML models. The legacy neuralnetwork format is deprecated.scenePhase == .background.@Generable properties in logical generation ordercontextSize)Sendable-conformant or @MainActor-isolated@Generable, tool calling, prompt design