Skill

llama-tune

Tune llama-server for optimal performance and GPU utilization. Analyzes GPU VRAM, model architecture (dense/MoE), calculates VRAM budget, and generates launch command for maximum tok/s.

npx claudepluginhub kengbailey/bailey-marketplace --plugin llama-tune

Tool Access

This skill is limited to using the following tools:

Bash Read Glob Grep

Preview

!`nvidia-smi --query-gpu=index,name,memory.total,memory.free,memory.used,pci.bus_id --format=csv 2>/dev/null || echo "NO_NVIDIA_GPU"`

Supporting Assets

reference.md

SKILL.md

Similar Skills

local-llm-expert

36.4k

Guides local LLM inference, model selection, VRAM optimization, quantization (GGUF, EXL2, AWQ), deployment with Ollama, llama.cpp, vLLM, LM Studio for privacy.

antigravity-awesome-skills

llm-serving-patterns

Provides patterns for LLM inference infrastructure with serving frameworks like vLLM, TGI, TensorRT-LLM; quantization, batching strategies, KV cache, and streaming responses. Use for optimizing latency and scaling deployments.

3 tools

systems-design

llamafile

Configures Mozilla Llamafile to run GGUF models locally with OpenAI-compatible API. Manages installation, server startup, GPU/CPU configs, SDK integrations, and troubleshooting.

llamafile

Stats

Parent Repo Stars1

Parent Repo Forks0

Last CommitApr 2, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

-m <model_path> # model file -c <context> # context length -ngl 999 # full layer offload (adjust for dense partial) -fa on # flash attention (required for KV quant) -ctk <q8_0|q4_0> -ctv <q4_0|q8_0> # KV cache quantization -t <physical_cores> # CPU threads = physical cores only -b 2048 -ub 512 # batch sizes (defaults) --host 0.0.0.0 --port <port> # network binding

Property	Value
Model	(name)
Architecture	Dense / MoE (X experts, Y active)
File size	X GB
Quantization	(detected)
Layers	N

Property

Value

Model

(name)

Architecture

Dense / MoE (X experts, Y active)

File size

X GB

Quantization

(detected)

Layers

Component	Size	Notes
Model weights (GPU)	X GB	(or partial if offloaded)
KV cache	X GB	(type, context, slots)
Expert weights (CPU)	X GB	(if MoE offload)
Overhead	~0.5 GB
Total GPU	X / Y GB	(Z% utilization)

Component

Size

Notes

Model weights (GPU)

X GB

(or partial if offloaded)

KV cache

X GB

(type, context, slots)

Expert weights (CPU)

X GB

(if MoE offload)

Overhead

~0.5 GB

Total GPU

X / Y GB

(Z% utilization)

Property	Value
Model	(name)
Architecture	Dense / MoE (X experts, Y active)
File size	X GB
Quantization	(detected)
Layers	N

Property

Value

Model

(name)

Architecture

Dense / MoE (X experts, Y active)

File size

X GB

Quantization

(detected)

Layers

Component	Size	Notes
Model weights (GPU)	X GB	(or partial if offloaded)
KV cache	X GB	(type, context, slots)
Expert weights (CPU)	X GB	(if MoE offload)
Overhead	~0.5 GB
Total GPU	X / Y GB	(Z% utilization)

Component

Size

Notes

Model weights (GPU)

X GB

(or partial if offloaded)

KV cache

X GB

(type, context, slots)

Expert weights (CPU)

X GB

(if MoE offload)

Overhead

~0.5 GB

Total GPU

X / Y GB

(Z% utilization)

llama-tune

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

llama-tune

Tool Access

Preview

Supporting Assets

SKILL.md

System Profile (auto-detected)

GPU

CPU

RAM

llama-server binary

Available llama.cpp tools

Arguments

Tuning Procedure

Step 1: Inspect the Model

Step 2: Calculate VRAM Budget

Step 3: Pick Strategy

Step 4: Build the Command

Step 5: Present Results

Similar Skills

Help us improve

System Profile (auto-detected)

GPU

CPU

RAM

llama-server binary

Available llama.cpp tools

Arguments

Tuning Procedure

Step 1: Inspect the Model

Step 2: Calculate VRAM Budget

Step 3: Pick Strategy

Step 4: Build the Command

Step 5: Present Results