By kengbailey
Tune llama-server for optimal performance and GPU utilization. Analyzes GPU VRAM, model architecture (dense/MoE), and generates launch commands for maximum tok/s.
npx claudepluginhub kengbailey/bailey-marketplace --plugin llama-tunePersonal Claude Code plugin marketplace.
Add this marketplace to Claude Code:
/plugin marketplace add <owner>/bailey-claude-marketplace
Then install individual plugins:
/plugin install <plugin-name>@bailey-marketplace
| Plugin | Description | Source |
|---|---|---|
claude-mem | Persistent memory system for Claude Code. Captures tool usage, compresses observations with AI, and re-injects relevant context into future sessions. | External (thedotmack/claude-mem) |
llama-tune | Tune llama-server for optimal performance and GPU utilization. Supports dense and MoE models. | In-repo |
Persistent memory across Claude Code sessions. Automatically captures everything Claude does, compresses it with AI, and provides continuity in future sessions.
Auto-installed dependencies (installed on first run):
Runtime:
localhost:37777http://localhost:37777~/.claude-mem/Install:
/plugin install claude-mem@bailey-marketplace
Tunes llama-server (llama.cpp) launch parameters for maximum tok/s on your hardware. Auto-detects GPU VRAM, CPU cores, and system RAM. Inspects GGUF model files to determine architecture (dense vs MoE), then calculates optimal flags including KV cache quantization, flash attention, expert offloading (MoE), and partial GPU layer placement.
Features:
llama-ggufSkill: /llama-tune <model.gguf> [--ctx SIZE] [--slots N] [--port PORT] [--launch]
Install:
/plugin install llama-tune@bailey-marketplace
In-repo plugins go in the plugins/ directory. External plugins are referenced by source in .claude-plugin/marketplace.json.
MIT
When setting up local LLM inference without cloud APIs. When running GGUF models locally. When needing OpenAI-compatible API from a local model. When building offline/air-gapped AI tools. When troubleshooting local LLM server connections.
Share bugs, ideas, or general feedback.
Run AI models locally with Ollama - free alternative to OpenAI, Anthropic, and other paid LLM APIs. Zero-cost, privacy-first AI infrastructure.
Specialized skills for LLM engineering tasks including model training, evaluation, fine-tuning, and deployment optimization.
Smart LLM routing with Claude subscription monitoring, complexity-first model selection, and 20+ AI providers
Ultra-compressed communication mode. Cuts ~75% of tokens while keeping full technical accuracy by speaking like a caveman.
Comprehensive UI/UX design plugin for mobile (iOS, Android, React Native) and web applications with design systems, accessibility, and modern patterns