From aradotso-trending-skills-37
Runs free 35B local AI coding agent on Apple Silicon Macs using llama.cpp or MLX backends with web search, shell execution, and file tools for offline coding assistance.
npx claudepluginhub joshuarweaver/cascade-ai-ml-agents-misc-1 --plugin aradotso-trending-skills-37This skill uses the workspace's default tool permissions.
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Skill by ara.so — Daily 2026 Skills collection.
Run a 35B reasoning model locally on your Mac for $0/month. mac-code is a CLI AI coding agent (Claude Code alternative) that routes tasks — web search, shell commands, file edits, chat — through a local LLM. Supports llama.cpp (30 tok/s) and MLX (64K context, persistent KV cache) backends on Apple Silicon.
search, shell, or chat and routes accordinglyq4_0 keys/values)brew install llama.cpp
pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packages
git clone https://github.com/walter-grace/mac-code
cd mac-code
35B MoE — fast daily driver (10.6 GB, fits in 16 GB RAM):
mkdir -p ~/models
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
local_dir='$HOME/models/'
)
"
9B — 64K context, long documents (5.3 GB):
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-9B-GGUF',
'Qwen3.5-9B-Q4_K_M.gguf',
local_dir='$HOME/models/'
)
"
llama-server \
--model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 12288 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -np 1 -t 4
llama-server \
--model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 65536 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -t 4
# Starts server on port 8000, downloads model on first run
python3 mlx/mlx_engine.py
python3 agent.py
Inside the agent REPL, type / for all commands:
| Command | Action |
|---|---|
/agent | Agent mode with tools (default) |
/raw | Direct streaming, no tools |
/model 9b | Switch to 9B model (64K context) |
/model 35b | Switch to 35B MoE |
/search <query> | Quick DuckDuckGo search |
/bench | Run speed benchmark |
/stats | Session statistics |
/cost | Show cost savings vs cloud |
/good / /bad | Grade the last response |
/improve | View response grading stats |
/clear | Reset conversation |
/quit | Exit |
> find all Python files modified in the last 7 days
→ routes to "shell", generates: find . -name "*.py" -mtime -7
> who won the NBA finals
→ routes to "search", queries DuckDuckGo, summarizes
> explain how attention works
→ routes to "chat", streams directly
The MLX engine exposes a REST API on localhost:8000.
curl -X POST localhost:8000/v1/context/save \
-H "Content-Type: application/json" \
-d '{"name": "my-project", "prompt": "$(cat README.md)"}'
curl -X POST localhost:8000/v1/context/load \
-H "Content-Type: application/json" \
-d '{"name": "my-project"}'
# Requires R2 credentials in environment
export R2_ACCOUNT_ID=your_account_id
export R2_ACCESS_KEY_ID=your_key_id
export R2_SECRET_ACCESS_KEY=your_secret
export R2_BUCKET=your_bucket_name
curl -X POST localhost:8000/v1/context/download \
-H "Content-Type: application/json" \
-d '{"name": "my-project"}'
import requests
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "local",
"messages": [{"role": "user", "content": "Write a Python quicksort"}],
"stream": False
})
print(response.json()["choices"][0]["message"]["content"])
import requests, json
with requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "local",
"messages": [{"role": "user", "content": "Explain transformers"}],
"stream": True
}, stream=True) as r:
for line in r.iter_lines():
if line.startswith(b"data: "):
chunk = json.loads(line[6:])
delta = chunk["choices"][0]["delta"].get("content", "")
print(delta, end="", flush=True)
Compress context 4x with 99.3% similarity:
from mlx.turboquant import compress_kv_cache
from mlx.kv_cache import save_kv_cache, load_kv_cache
# After building a KV cache from a long document
compressed = compress_kv_cache(kv_cache, bits=4) # 26.6 MB → 6.7 MB
save_kv_cache(compressed, "my-project-compressed")
# Load later
kv = load_kv_cache("my-project-compressed")
For models larger than your RAM (research mode):
cd research/flash-streaming
# Run 35B MoE Expert Sniper (22 GB model, 1.42 GB RAM)
python3 moe_expert_sniper.py
# Run 32B dense flash stream (18.4 GB model, 4.5 GB RAM)
python3 flash_stream_v2.py
import os, fcntl
# Open model file bypassing macOS Unified Buffer Cache
fd = os.open("model.bin", os.O_RDONLY)
fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) # bypass page cache
# Aligned read (16KB boundary for DART IOMMU)
ALIGN = 16384
offset = (layer_offset // ALIGN) * ALIGN
data = os.pread(fd, layer_size + ALIGN, offset)
weights = data[layer_offset - offset : layer_offset - offset + layer_size]
# Router predicts which 8 of 256 experts activate per token
active_experts = router_forward(hidden_state) # returns [8] indices
# Load only those experts from SSD (8 threads, parallel pread)
from concurrent.futures import ThreadPoolExecutor
def load_expert(expert_idx):
offset = expert_offsets[expert_idx]
return os.pread(fd, expert_size, offset)
with ThreadPoolExecutor(max_workers=8) as pool:
expert_weights = list(pool.map(load_expert, active_experts))
# ~14 MB loaded per layer instead of 221 MB (dense)
import requests
BASE = "http://localhost:8000/v1"
def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
r = requests.post(f"{BASE}/chat/completions", json={
"model": "local",
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
]
})
return r.json()["choices"][0]["message"]["content"]
# Examples
print(ask("Write a Python function to parse JSON safely"))
print(ask("Explain this error: AttributeError: NoneType has no attribute split"))
from mlx.paged_inference import PagedInference
engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")
with open("large_codebase.txt") as f:
content = f.read() # beyond single context window
# Automatically pages through content
result = engine.summarize(content, question="What does this codebase do?")
print(result)
python3 dashboard.py
| Your Mac RAM | Best Option | Command |
|---|---|---|
| 8 GB | 9B Q4_K_M | --model ~/models/Qwen3.5-9B-Q4_K_M.gguf --ctx-size 4096 |
| 16 GB | 35B IQ2_M (30 tok/s) | Default Option A above |
| 16 GB (quality) | 35B Q4 Expert Sniper | python3 research/flash-streaming/moe_expert_sniper.py |
| 48 GB | 35B Q4_K_M native | Download full Q4, --n-gpu-layers 99 |
| 192 GB | 397B frontier | Any large GGUF, full offload |
# Check if server is running
curl http://localhost:8000/health
# Check what's on port 8000
lsof -i :8000
# Restart llama-server with verbose logging
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --verbose
# Resume interrupted download
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
local_dir='$HOME/models/',
resume_download=True
)
"
# Reduce context size to free RAM
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --ctx-size 4096 \ # reduced from 12288
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 -t 4
# Or switch to 9B for lower RAM usage
python3 agent.py
# Then: /model 9b
# MLX uses unified memory — check pressure
vm_stat | grep "Pages free"
# Reduce batch size in mlx_engine.py
# Edit: max_batch_size = 512 → max_batch_size = 128
# Verify F_NOCACHE is active
import fcntl, os
fd = os.open(model_path, os.O_RDONLY)
result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1)
assert result == 0, "F_NOCACHE failed — check macOS version and SIP status"
ddgs search failspip3 install --upgrade ddgs --break-system-packages
# ddgs uses DuckDuckGo — no API key required, but may rate-limit
# Retry after 60 seconds if you get a 202 response
# GGUF tensors are column-major — correct reshape:
weights = dequantized_flat.reshape(ne[1], ne[0]) # CORRECT
# NOT: dequantized_flat.reshape(ne[0], ne[1]).T # WRONG
agent.py
├── Intent classification → "search" | "shell" | "chat"
├── search → ddgs.DDGS().text() → summarize
├── shell → generate command → subprocess.run()
└── chat → stream directly
Backends (both expose OpenAI-compatible API on :8000)
├── llama.cpp → fast, standard, no persistence
└── mlx/ → KV cache save/load/compress/sync
Flash Streaming (research/)
├── moe_expert_sniper.py → 35B Q4, 1.42 GB RAM
└── flash_stream_v2.py → 32B dense, 4.5 GB RAM
└── F_NOCACHE + pread + 16KB alignment