npx claudepluginhub jamie-bitflight/claude_skills --plugin llamafileThis skill uses the workspace's default tool permissions.
Configure and manage Mozilla Llamafile - a cross-platform executable distribution format that runs LLMs locally with an OpenAI-compatible API.
Unifies Python LLM API calls to 100+ providers (OpenAI, Anthropic, Ollama, llamafile) in OpenAI format with retries, fallbacks, exceptions, cost tracking. Triggers on litellm imports/completion().
Optimizes local LLM inference, model selection, VRAM usage, and deployment using Ollama, llama.cpp, vLLM, LM Studio. Covers GGUF/EXL2 quantization and privacy-first setups for offline AI apps.
Searches Hugging Face Hub for llama.cpp-compatible GGUF models, recommends quants, launches local servers on CPU, Metal, CUDA, or ROCm with OpenAI API compatibility.
Share bugs, ideas, or general feedback.
Configure and manage Mozilla Llamafile - a cross-platform executable distribution format that runs LLMs locally with an OpenAI-compatible API.
Use this skill when:
Llamafile combines llama.cpp with Cosmopolitan Libc to create single-file executables that:
/health endpoint for monitoring--embedding flagLlamafile exposes these OpenAI-compatible endpoints when running with --server:
| Endpoint | Description | Requirements |
|---|---|---|
http://localhost:8080/v1/chat/completions | Chat completions (primary) | Server mode |
http://localhost:8080/v1/completions | Text completions | Server mode |
http://localhost:8080/v1/embeddings | Generate embeddings | --embedding flag |
http://localhost:8080/health | Health check | Server mode |
Critical Detail: All OpenAI-compatible endpoints require /v1 prefix in the URL path.
# Download llamafile v0.9.3 binary
curl -L -o llamafile https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3
# Make executable
chmod 755 llamafile
# Verify version
./llamafile --version
Alternative download sources:
https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3https://sourceforge.net/projects/llamafile.mirror/files/0.9.3/Llamafile supports two approaches: pre-packaged llamafile executables (model embedded) or separate GGUF model files.
Pre-packaged llamafile (easiest):
# Download a llamafile with embedded model
curl -LO https://huggingface.co/mozilla-ai/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile --server --nobrowser
Separate GGUF model (use with ./llamafile --server -m model.gguf):
Download GGUF files from HuggingFace model publishers, then load with the llamafile binary.
Pre-packaged llamafile models from mozilla-ai:
| Model | Size | Use Case | Download |
|---|---|---|---|
| Qwen3-0.6B | ~500MB | Fast, lower quality | mozilla-ai/Qwen3-0.6B-llamafile |
| Mistral 7B v0.2 | ~4GB | Balanced speed/quality | mozilla-ai/Mistral-7B-Instruct-v0.2-llamafile |
| Llama 3.1 8B | ~5GB | Higher quality, slower | mozilla-ai/Meta-Llama-3.1-8B-Instruct-llamafile |
| LLaVA v1.5 7B | ~4GB | Multimodal (text+image) | mozilla-ai/llava-v1.5-7b-llamafile |
These are self-contained executables — download, chmod +x, and run. No separate llamafile binary needed.
Start llamafile server for local API access:
./llamafile --server \
-m /path/to/model.gguf \
--nobrowser \
--port 8080 \
--host 127.0.0.1
Critical flags explained:
--server: Required to enable HTTP API endpoints-m: Path to GGUF model file (required)--nobrowser: Prevents auto-opening browser on startup--port 8080: Default port (note: NOT 8000)--host 127.0.0.1: Localhost only (secure default)For GPU-accelerated inference with higher throughput:
./llamafile --server \
-m /path/to/model.gguf \
--nobrowser \
--port 8080 \
--host 127.0.0.1 \
--ctx-size 4096 \
--n-gpu-layers 99 \
--threads 8 \
--cont-batching \
--parallel 4
Advanced flags:
| Flag | Purpose | Default | When to Use |
|---|---|---|---|
--ctx-size | Prompt context window size | 512 | Increase for longer conversations |
--n-gpu-layers | GPU offload layer count | 0 | Set to 99 to offload all layers to GPU |
--threads | CPU threads for generation | Auto | Set explicitly for consistent performance |
--threads-batch | Threads for batch processing | Same as --threads | Tune separately for prompt vs generation |
--cont-batching | Continuous batching | Off | Enable for multiple concurrent requests |
--parallel | Parallel sequence count | 1 | Increase for concurrent request handling |
--mlock | Lock model in memory | Off | Prevent swapping on systems with sufficient RAM |
--embedding | Enable embeddings endpoint | Off | Required for /v1/embeddings API |
To allow connections from other machines (development/testing only):
./llamafile --server \
-m /path/to/model.gguf \
--nobrowser \
--host 0.0.0.0 \
--port 8080
Security warning: Binding to 0.0.0.0 exposes the API to network access. Use only in trusted environments.
LiteLLM provides unified interface for llamafile and cloud LLM providers.
import litellm
response = litellm.completion(
model="llamafile/gemma-3-3b", # MUST use llamafile/ prefix
messages=[{"role": "user", "content": "Hello, world!"}],
api_base="http://localhost:8080/v1", # MUST include /v1 suffix
temperature=0.3,
max_tokens=200
)
print(response.choices[0].message.content)
Critical requirements for LiteLLM:
llamafile/ prefix for routingapi_base MUST include /v1 suffixRelated skill: For comprehensive LiteLLM configuration, activate the litellm skill:
Skill(skill: "litellm:litellm")
Direct integration with OpenAI SDK for llamafile endpoints:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1", # MUST include /v1
api_key="sk-no-key-required" # Any value works
)
response = client.chat.completions.create(
model="local-model", # Model name is flexible
messages=[
{"role": "user", "content": "Hello, world!"}
],
temperature=0.3,
max_tokens=200
)
print(response.choices[0].message.content)
Verify llamafile server is responding correctly:
# Health check
curl http://localhost:8080/health
# Chat completions
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello"}],
"temperature": 0.3,
"max_tokens": 200
}'
# Embeddings (requires --embedding flag on server)
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"input": ["Hello world"]
}'
Python script to start llamafile as background process with health checking:
import subprocess
import time
import httpx
def start_llamafile(
llamafile_path: str,
model_path: str,
port: int = 8080,
host: str = "127.0.0.1"
) -> subprocess.Popen:
"""Start llamafile server as background process."""
cmd = [
llamafile_path,
"--server",
"-m", model_path,
"--nobrowser",
"--port", str(port),
"--host", host,
]
process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
_wait_for_server(host, port)
return process
def _wait_for_server(host: str, port: int, timeout: int = 30) -> None:
"""Wait for server to respond to health checks."""
url = f"http://{host}:{port}/health"
start = time.time()
while time.time() - start < timeout:
try:
response = httpx.get(url, timeout=2)
if response.status_code == 200:
return
except httpx.RequestError:
pass
time.sleep(0.5)
raise TimeoutError(f"Server did not start within {timeout} seconds")
Example TOML configuration for applications using llamafile:
# ~/.config/app-name/config.toml
[ai]
model = "llamafile/gemma-3-3b" # Must use llamafile/ prefix
temperature = 0.3
max_tokens = 200
[llamafile]
path = "/home/user/.local/bin/llamafile"
model_path = "/home/user/.local/share/app-name/models/gemma-3-3b.gguf"
api_base = "http://127.0.0.1:8080/v1" # Include /v1 suffix
Check if port is already in use:
# Find process using port 8080
lsof -i :8080
# Kill existing process
kill $(lsof -t -i :8080)
Verify model file exists and is readable:
ls -lh /path/to/model.gguf
Check llamafile binary permissions:
ls -la /path/to/llamafile
# Should show: -rwxr-xr-x (executable)
# Fix permissions if needed
chmod 755 /path/to/llamafile
Verify server is running:
# Check health endpoint
curl http://localhost:8080/health
# Check server is listening
netstat -tlnp | grep 8080
# or
lsof -i :8080
Common causes:
--server flag/v1 in API URL path127.0.0.1 but accessing from another machineTest basic connectivity:
# Verbose health check
curl -v http://localhost:8080/health
# Test chat completions with verbose output
curl -v http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"test","messages":[{"role":"user","content":"Hi"}]}'
Common API issues:
| Error | Cause | Solution |
|---|---|---|
| 404 Not Found | Missing /v1 in URL | Add /v1 before endpoint path |
| Connection refused | Server not running | Start server with --server flag |
| Timeout | Model loading slowly | Wait longer or use smaller model |
| Invalid model | Wrong model path | Verify -m path to GGUF file |
Optimize inference speed:
--n-gpu-layers 99--threads 8--cont-batching--ctx-size 2048Check GPU availability:
# NVIDIA GPU
nvidia-smi
# AMD GPU
rocm-smi
# Apple Metal (check activity monitor)
Avoid these frequent errors when using llamafile:
/v1 in API URL: Always include /v1 suffix for OpenAI-compatible endpointsllamafile/ prefix in model name for proper routingchmod 755)--n-gpu-layers on CPU-only systems causes errorsCurrent stable version: 0.9.3 (May 14, 2025)
Version constants:
LLAMAFILE_MAJOR = 0
LLAMAFILE_MINOR = 9
LLAMAFILE_PATCH = 3
Recent changes in 0.9.3:
Skills to activate:
litellm - For unified LLM provider interface and routing
Skill(skill: "litellm:litellm")
External tools: