Help us improve
Share bugs, ideas, or general feedback.
From llamacpp
Provides complete llama.cpp C/C++ API reference covering model loading, inference, text generation, embeddings, chat, tokenization, sampling, batching, KV cache, LoRA adapters, and state management. For C LLM integration, workflows, and troubleshooting.
npx claudepluginhub datathings/marketplace --plugin llamacppHow this skill is triggered — by the user, by Claude, or both
Slash command
/llamacpp:llamacppThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.
Configures Mozilla Llamafile to run GGUF models locally with OpenAI-compatible API. Manages installation, server startup, GPU/CPU configs, SDK integrations, and troubleshooting.
Optimizes local LLM inference, model selection, VRAM usage, and deployment using Ollama, llama.cpp, vLLM, LM Studio. Covers GGUF/EXL2 quantization and privacy-first setups for offline AI apps.
Unifies Python LLM API calls to 100+ providers (OpenAI, Anthropic, Ollama, llamafile) in OpenAI format with retries, fallbacks, exceptions, cost tracking. Triggers on litellm imports/completion().
Share bugs, ideas, or general feedback.
Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.
llama.cpp is a C/C++ implementation for LLM inference with minimal dependencies and state-of-the-art performance. This skill provides:
See references/workflows.md for complete working examples. Basic workflow:
llama_backend_init() - Initialize backendllama_model_load_from_file() - Load modelllama_init_from_model() - Create contextllama_tokenize() - Convert text to tokensllama_decode() - Process tokensllama_sampler_sample() - Sample next tokenUse this skill when:
llama_model: Loaded model weights and architecturellama_context: Inference state (KV cache, compute buffers)llama_batch: Input tokens and positions for processingllama_sampler: Token sampling configurationllama_vocab: Vocabulary and tokenizerllama_memory_t: KV cache memory handlellama_backend_init()llama_model_load_from_file()llama_init_from_model()llama_tokenize()llama_encode() or llama_decode()llama_sampler_sample()For detailed API documentation, the complete API is split across 6 files for efficient targeted loading. Start with references/api-core.md which links to all other sections.
API Files:
Total: ~197 active functions (b8305) across 6 organized files
Most common: llama_backend_init(), llama_model_load_from_file(), llama_init_from_model(), llama_tokenize(), llama_decode(), llama_sampler_sample(), llama_vocab_is_eog(), llama_memory_clear()
See references/api.md for all ~197 function signatures.
See references/workflows.md for 13 complete working examples: basic text generation, chat, embeddings, batch processing, multi-sequence, LoRA, state save/load, custom sampling (XTC/DRY), encoder-decoder models, model detection, and memory management patterns.
See references/workflows.md for detailed best practices. Key points:
llama_model_default_params(), etc.)llama_n_ctx())llama_vocab_is_eog()End-of-generation check (llama_vocab_is_eog()), logits retrieval (llama_get_logits_ith()), batch creation (llama_batch_get_one()), tokenization buffer handling. See references/workflows.md for complete code examples.
Model loading fails:
n_gpu_layers if GPU memory insufficientTokenization returns negative value:
-n size and retryDecode/encode returns non-zero:
llama_batch_get_one() or llama_batch_init())llama_n_ctx())Silent failures / no output:
llama_vocab_is_eog() immediately returns truellama_log_set()Performance issues:
n_threads for CPUn_gpu_layers for GPU offloadingn_batch for promptsSliding Window Attention (SWA) issues:
ctx_params.swa_full = true to access beyond attention windowllama_model_n_swa(model) to detect SWA size and configuration needsPer-sequence state errors:
llama_state_seq_load_file(ctx, "file", dest_seq_id, ...)Model type detection:
llama_model_has_encoder() before assuming decoder-only architecturellama_encode() then llama_decode() workflowFor advanced issues: https://github.com/ggerganov/llama.cpp/discussions
New Functions:
llama_model_init_from_user() - Create models from GGUF metadata with custom tensor data callbacksNew Model Params:
use_direct_io (bool) - Use direct I/O, takes precedence over use_mmap when supportedno_alloc (bool) - Only load metadata and simulate memory allocationsNew Enum Values:
LLAMA_VOCAB_TYPE_PLAMO2 = 6 - PLaMo-2 tokenizer based on Aho-Corasick with dynamic programmingLLAMA_FTYPE_MOSTLY_MXFP4_MOE = 38 - MXFP4 quantization for MoE modelsLLAMA_FTYPE_MOSTLY_NVFP4 = 39 - NVFP4 quantizationPrevious (b8191) additions still current:
llama_model_meta_key enum for sampling metadata keyskv_unified, swa_full context paramsLLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY and _ext state functionsIf you're updating old code:
llama_model_load_from_file() instead of llama_load_model_from_file()llama_model_free() instead of llama_free_model()llama_init_from_model() instead of llama_new_context_with_model()llama_vocab_*() functions instead of llama_token_*()llama_state_*() functions instead of deprecated state functionsllama_set_adapters_lora() instead of llama_set_adapter_lora() for LoRA adaptersllama_vocab_bos() instead of llama_vocab_cls() (CLS is equivalent to BOS)llama_sampler_init_grammar_lazy_patterns() instead of llama_sampler_init_grammar_lazy()See the API reference for complete mappings.