From itsmostafa-llm-engineering-skills
Running and fine-tuning LLMs on Apple Silicon with MLX. Use when working with models locally on Mac, converting Hugging Face models to MLX format, fine-tuning with LoRA/QLoRA on Apple Silicon, or serving models via HTTP API.
npx claudepluginhub joshuarweaver/cascade-ai-ml-engineering --plugin itsmostafa-llm-engineering-skillsThis skill uses the workspace's default tool permissions.
MLX-LM is a Python package for running large language models on Apple Silicon, leveraging the MLX framework for optimized performance with unified memory architecture.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
MLX-LM is a Python package for running large language models on Apple Silicon, leveraging the MLX framework for optimized performance with unified memory architecture.
| Aspect | PyTorch on Mac | MLX |
|---|---|---|
| Memory | Separate CPU/GPU copies | Unified memory, no copies |
| Optimization | Generic Metal backend | Apple Silicon native |
| Model loading | Slower, more memory | Lazy loading, efficient |
| Quantization | Limited support | Built-in 4/8-bit |
MLX arrays live in shared memory, accessible by both CPU and GPU without data transfer overhead.
MLX-LM supports most popular architectures: Llama, Mistral, Qwen, Phi, Gemma, Cohere, and many more. Check the mlx-community on Hugging Face for pre-converted models.
pip install mlx-lm
Requires macOS 13.5+ and Apple Silicon (M1/M2/M3/M4).
from mlx_lm import load, generate
# Load model (from HF hub or local path)
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
# Generate text
response = generate(
model,
tokenizer,
prompt="Explain quantum computing in simple terms:",
max_tokens=256,
temp=0.7,
)
print(response)
from mlx_lm import load, stream_generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
prompt = "Write a haiku about programming:"
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
print(response.text, end="", flush=True)
print()
from mlx_lm import load, batch_generate
model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
prompts = [
"What is machine learning?",
"Explain neural networks:",
"Define deep learning:",
]
responses = batch_generate(
model,
tokenizer,
prompts,
max_tokens=100,
)
for prompt, response in zip(prompts, responses):
print(f"Q: {prompt}\nA: {response}\n")
# Basic generation
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--prompt "Explain recursion:" \
--max-tokens 256
# With sampling parameters
mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
--prompt "Write a poem about AI:" \
--temp 0.8 \
--top-p 0.95
# Start chat REPL (context preserved between turns)
mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the capital of France?"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)
Convert Hugging Face models to MLX format:
# Convert with 4-bit quantization
mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct \
-q # Quantize to 4-bit
# With specific quantization
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
-q \
--q-bits 8 \
--q-group-size 64
# Upload to Hugging Face Hub
mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct \
-q \
--upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx
from mlx_lm import convert
convert(
hf_path="meta-llama/Llama-3.2-3B-Instruct",
mlx_path="./llama-3.2-3b-mlx",
quantize=True,
q_bits=4,
q_group_size=64,
)
| Option | Default | Description |
|---|---|---|
--q-bits | 4 | Quantization bits (4 or 8) |
--q-group-size | 64 | Group size for quantization |
--dtype | float16 | Data type for non-quantized weights |
MLX supports multiple quantization methods for different use cases:
| Method | Best For | Command |
|---|---|---|
| Basic | Quick conversion | mlx_lm.convert -q |
| DWQ | Quality-preserving | mlx_lm.dwq |
| AWQ | Activation-aware | mlx_lm.awq |
| Dynamic | Per-layer precision | mlx_lm.dynamic_quant |
| GPTQ | Established method | mlx_lm.gptq |
# 4-bit quantization during conversion
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q
# 8-bit for higher quality
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8
For detailed coverage of each method, see reference/quantization.md.
MLX supports LoRA and QLoRA fine-tuning for efficient adaptation on Apple Silicon.
# Prepare training data (JSONL format)
# {"text": "Your training text here"}
# or
# {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
# Fine-tune with LoRA
mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--train \
--data ./data \
--iters 1000
# Generate with adapter
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--adapter-path ./adapters \
--prompt "Your prompt here"
# Merge LoRA weights into base model
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--adapter-path ./adapters \
--save-path ./fused-model
# Or export to GGUF
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--adapter-path ./adapters \
--export-gguf
For detailed LoRA configuration and training patterns, see reference/fine-tuning.md.
# Start server
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080
# Use with OpenAI client
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 256
}'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Explain MLX in one sentence."}],
max_tokens=100,
)
print(response.choices[0].message.content)
Use pre-quantized models: Download from mlx-community on Hugging Face for immediate use
Match quantization to your hardware: M1/M2 with 8GB: use 4-bit; M2/M3 Pro/Max: 8-bit for quality
Leverage unified memory: Unlike CUDA, MLX models can exceed "GPU memory" by using swap (slower but works)
Use streaming for UX: stream_generate provides responsive output for interactive applications
Cache prompt prefixes: Use mlx_lm.cache_prompt for repeated prompts with varying suffixes
Batch similar requests: batch_generate is more efficient than sequential generation
Start with 4-bit quantization: Good quality/size tradeoff; upgrade to 8-bit if quality issues
Fuse adapters for deployment: After fine-tuning, fuse adapters for faster inference without loading separately
Monitor memory with Activity Monitor: Watch memory pressure to avoid swap thrashing
Use chat templates: Always apply tokenizer.apply_chat_template() for instruction-tuned models
See reference/ for detailed documentation:
quantization.md - Detailed quantization methods and when to use eachfine-tuning.md - Complete LoRA/QLoRA training guide with data formats and configuration