From aradotso-trending-skills-37
Implements lossless DFlash speculative decoding for MLX on Apple Silicon, accelerating LLM inference 1.7–4x via block diffusion drafting with Qwen model pairs. Useful for faster generation and OpenAI-compatible servers.
npx claudepluginhub joshuarweaver/cascade-ai-ml-agents-misc-1 --plugin aradotso-trending-skills-37This skill uses the workspace's default tool permissions.
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Skill by ara.so — Daily 2026 Skills collection.
DFlash implements lossless speculative decoding for MLX on Apple Silicon. A small draft model (~1B params) generates 16 tokens in parallel using block diffusion; the target model verifies all 16 in a single forward pass. Tokens are only emitted after target verification — output is lossless (every token is the target model's greedy argmax).
Typical speedups: 1.7x–4.1x over baseline mlx_lm depending on model size and context length. Acceptance rates hover around 87–90% for Qwen3.5 models.
pip install dflash-mlx
# or isolated install
pipx install dflash-mlx
Requires Python 3.10+, MLX 0.31.1+, Apple Silicon Mac.
# Auto-resolve draft model from registry
dflash --model Qwen/Qwen3.5-9B --prompt "Explain backpropagation"
# Explicit draft model
dflash --model Qwen/Qwen3.5-9B \
--draft z-lab/Qwen3.5-9B-DFlash \
--prompt "Explain backpropagation"
# Disable EOS (useful for benchmarking fixed token counts)
dflash --model Qwen/Qwen3.5-9B --prompt "..." --max-tokens 1024 --no-eos
# Basic server
dflash-serve --model Qwen/Qwen3.5-9B --port 8000
# With explicit draft
dflash-serve --model Qwen/Qwen3.5-9B \
--draft z-lab/Qwen3.5-9B-DFlash \
--port 8000
# Disable thinking/reasoning tokens (Qwen3.5 thinking models)
dflash-serve --model Qwen/Qwen3.5-9B --port 8000 \
--chat-template-args '{"enable_thinking": false}'
# Raise fallback threshold for longer prompts (large models)
dflash-serve --model mlx-community/Qwen3.5-35B-A3B-4bit --port 8000 \
--chat-template-args '{"enable_thinking": false}' \
--dflash-max-ctx 16384
dflash-benchmark \
--model Qwen/Qwen3.5-9B \
--draft z-lab/Qwen3.5-9B-DFlash \
--prompt "The function f satisfies..." \
--max-tokens 1024 \
--repeat 3 \
--no-eos
Outputs per-run JSON reports with tok/s, acceptance rate, and speedup vs baseline.
| Target Model | Draft Model |
|---|---|
Qwen/Qwen3.5-4B | z-lab/Qwen3.5-4B-DFlash |
Qwen/Qwen3.5-9B | z-lab/Qwen3.5-9B-DFlash |
mlx-community/Qwen3.5-27B-4bit | z-lab/Qwen3.5-27B-DFlash |
mlx-community/Qwen3.5-35B-A3B-4bit | z-lab/Qwen3.5-35B-A3B-DFlash |
Draft models are auto-resolved from a registry — no --draft flag needed for listed pairs. Models without a matching draft are rejected at startup.
from dflash_mlx import DFlashRuntime
runtime = DFlashRuntime.from_pretrained(
model="Qwen/Qwen3.5-9B",
draft="z-lab/Qwen3.5-9B-DFlash", # optional, auto-resolved
)
prompt = "Explain the Pythagorean theorem step by step."
for token_text in runtime.stream_generate(
prompt=prompt,
max_tokens=512,
use_chat_template=True,
):
print(token_text, end="", flush=True)
print()
from dflash_mlx import DFlashRuntime
runtime = DFlashRuntime.from_pretrained(model="Qwen/Qwen3.5-9B")
result = runtime.generate(
prompt="What is speculative decoding?",
max_tokens=256,
use_chat_template=True,
)
print(result.text)
print(f"Tokens/sec: {result.tokens_per_second:.2f}")
print(f"Acceptance rate: {result.acceptance_rate:.2%}")
print(f"Total tokens: {result.total_tokens}")
from dflash_mlx import DFlashRuntime, DFlashConfig
config = DFlashConfig(
draft_block_size=16, # tokens drafted per speculative step
max_ctx=8192, # max context length before fallback
enable_tape_replay=True, # GatedDeltaNet recurrent rollback
jit_sdpa=True, # custom Metal SDPA for long contexts
)
runtime = DFlashRuntime.from_pretrained(
model="mlx-community/Qwen3.5-27B-4bit",
config=config,
)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # dflash-serve does not require auth by default
)
# Non-streaming
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[
{"role": "user", "content": "Explain gradient descent."}
],
max_tokens=512,
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": "Write a haiku about silicon."}],
max_tokens=128,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
},
"required": ["city"],
},
},
}
]
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
)
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Args: {json.loads(tool_call.function.arguments)}")
PYTHONPATH=. python3 -m examples.demo --mode dflash \
--target-model Qwen/Qwen3.5-9B \
--draft-model z-lab/Qwen3.5-9B-DFlash \
--prompt "Solve: f(x) + f(y) = f(x+y) - xy - 1" \
--max-tokens 2048 \
--no-eos
dflash-serve --model Qwen/Qwen3.5-9B --port 8000http://localhost:8000/v1Qwen/Qwen3.5-9B in the chat UIWorks the same for Continue, aider, OpenCode, and any OpenAI-compatible client.
# Force a custom draft — bypasses registry check
dflash --model my-org/MyCustomModel \
--draft my-org/MyCustomModel-DFlash \
--prompt "Hello"
# CLI
dflash --model Qwen/Qwen3.5-9B \
--chat-template-args '{"enable_thinking": false}' \
--prompt "What is 2+2?"
# Server
dflash-serve --model Qwen/Qwen3.5-9B \
--chat-template-args '{"enable_thinking": false}' \
--port 8000
Model rejected at startup
Error: No DFlash draft found for model 'org/ModelName'
→ Pass --draft org/ModelName-DFlash explicitly, or use a model from the supported pairs table.
Low acceptance rate (< 80%)
--dflash-max-ctx 8192 to extend the fallback threshold.Numerical divergence / output differs from pure AR
python -c "import mlx; print(mlx.__version__)"Server not accepting connections
# Check port is not in use
lsof -i :8000
# Bind to all interfaces for network access
dflash-serve --model Qwen/Qwen3.5-9B --port 8000 --host 0.0.0.0
Out of memory with large models
mlx-community/Qwen3.5-27B-4bit instead of the full model.Benchmark results JSON location
ls benchmark/results/
# Per-run JSON with tok/s, acceptance rate, repeat measurements