From huggingface-skills
Searches Hugging Face Hub for llama.cpp-compatible GGUF models, selects optimal quants, launches local servers with OpenAI-compatible API on CPU, Metal, CUDA, or ROCm.
npx claudepluginhub huggingface/skills --plugin huggingface-vision-trainerThis skill uses the workspace's default tool permissions.
Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the right quant, and launch the model with `llama-cli` or `llama-server`.
Guides local LLM inference, model selection, VRAM optimization, quantization (GGUF, EXL2, AWQ), deployment with Ollama, llama.cpp, vLLM, LM Studio for privacy.
Configures Mozilla Llamafile to run GGUF models locally with OpenAI-compatible API. Manages installation, server startup, GPU/CPU configs, SDK integrations, and troubleshooting.
Runs and fine-tunes LLMs on Apple Silicon using MLX-LM. Converts Hugging Face models to MLX format, quantizes, fine-tunes with LoRA/QLoRA, generates text via Python/CLI, and serves via HTTP API.
Share bugs, ideas, or general feedback.
Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the right quant, and launch the model with llama-cli or llama-server.
apps=llama.cpp.https://huggingface.co/<repo>?local-app=llama.cpp..gguf filenames with https://huggingface.co/api/models/<repo>/tree/main?recursive=true.llama-cli -hf <repo>:<QUANT> or llama-server -hf <repo>:<QUANT>.--hf-repo plus --hf-file when the repo uses custom file naming.brew install llama.cpp
winget install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make
hf auth login
https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=Qwen3.6&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
llama-server \
--hf-repo unsloth/Qwen3.6-35B-A3B-GGUF \
--hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
-c 4096
hf download <repo-without-gguf> --local-dir ./model-src
python convert_hf_to_gguf.py ./model-src \
--outfile model-f16.gguf \
--outtype f16
llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{"role": "user", "content": "Write a limerick about exception handling"}
]
}'
?local-app=llama.cpp page.UD-Q4_K_M instead of normalizing them.Q4_K_M unless the repo page or hardware profile suggests otherwise.Q5_K_M or Q6_K for code or technical workloads when memory allows.Q3_K_M, Q4_K_S, or repo-specific IQ / UD-* variants for tighter RAM or VRAM budgets.mmproj-*.gguf files as projector weights, not the main checkpoint.imatrix.https://github.com/ggml-org/llama.cpphttps://huggingface.co/docs/hub/gguf-llamacpphttps://huggingface.co/docs/hub/main/local-appshttps://huggingface.co/docs/hub/agents-localhttps://huggingface.co/spaces/ggml-org/gguf-my-repo