Complete llama.cpp C/C++ API reference covering model loading, inference, text generation, embeddings, chat, tokenization, sampling, batching, KV cache, LoRA adapters, and state management. Triggers on: llama.cpp questions, LLM inference code, GGUF models, local AI/ML inference, C/C++ LLM integration, "how do I use llama.cpp", API function lookups, implementation questions, troubleshooting llama.cpp issues, and any llama-cpp or ggerganov/llama.cpp mentions.
From llamacppnpx claudepluginhub datathings/marketplace --plugin llamacppThis skill uses the workspace's default tool permissions.
references/api-advanced.mdreferences/api-context.mdreferences/api-core.mdreferences/api-inference.mdreferences/api-model-info.mdreferences/api-sampling.mdreferences/api.mdreferences/workflows.mdSearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.
llama.cpp is a C/C++ implementation for LLM inference with minimal dependencies and state-of-the-art performance. This skill provides:
See references/workflows.md for complete working examples. Basic workflow:
llama_backend_init() - Initialize backendllama_model_load_from_file() - Load modelllama_init_from_model() - Create contextllama_tokenize() - Convert text to tokensllama_decode() - Process tokensllama_sampler_sample() - Sample next tokenUse this skill when:
llama_model: Loaded model weights and architecturellama_context: Inference state (KV cache, compute buffers)llama_batch: Input tokens and positions for processingllama_sampler: Token sampling configurationllama_vocab: Vocabulary and tokenizerllama_memory_t: KV cache memory handlellama_backend_init()llama_model_load_from_file()llama_init_from_model()llama_tokenize()llama_encode() or llama_decode()llama_sampler_sample()For detailed API documentation, the complete API is split across 6 files for efficient targeted loading. Start with references/api-core.md which links to all other sections.
API Files:
Total: ~197 active functions (b8305) across 6 organized files
Most common: llama_backend_init(), llama_model_load_from_file(), llama_init_from_model(), llama_tokenize(), llama_decode(), llama_sampler_sample(), llama_vocab_is_eog(), llama_memory_clear()
See references/api.md for all ~197 function signatures.
See references/workflows.md for 13 complete working examples: basic text generation, chat, embeddings, batch processing, multi-sequence, LoRA, state save/load, custom sampling (XTC/DRY), encoder-decoder models, model detection, and memory management patterns.
See references/workflows.md for detailed best practices. Key points:
llama_model_default_params(), etc.)llama_n_ctx())llama_vocab_is_eog()End-of-generation check (llama_vocab_is_eog()), logits retrieval (llama_get_logits_ith()), batch creation (llama_batch_get_one()), tokenization buffer handling. See references/workflows.md for complete code examples.
Model loading fails:
n_gpu_layers if GPU memory insufficientTokenization returns negative value:
-n size and retryDecode/encode returns non-zero:
llama_batch_get_one() or llama_batch_init())llama_n_ctx())Silent failures / no output:
llama_vocab_is_eog() immediately returns truellama_log_set()Performance issues:
n_threads for CPUn_gpu_layers for GPU offloadingn_batch for promptsSliding Window Attention (SWA) issues:
ctx_params.swa_full = true to access beyond attention windowllama_model_n_swa(model) to detect SWA size and configuration needsPer-sequence state errors:
llama_state_seq_load_file(ctx, "file", dest_seq_id, ...)Model type detection:
llama_model_has_encoder() before assuming decoder-only architecturellama_encode() then llama_decode() workflowFor advanced issues: https://github.com/ggerganov/llama.cpp/discussions
New Functions:
llama_model_init_from_user() - Create models from GGUF metadata with custom tensor data callbacksNew Model Params:
use_direct_io (bool) - Use direct I/O, takes precedence over use_mmap when supportedno_alloc (bool) - Only load metadata and simulate memory allocationsNew Enum Values:
LLAMA_VOCAB_TYPE_PLAMO2 = 6 - PLaMo-2 tokenizer based on Aho-Corasick with dynamic programmingLLAMA_FTYPE_MOSTLY_MXFP4_MOE = 38 - MXFP4 quantization for MoE modelsLLAMA_FTYPE_MOSTLY_NVFP4 = 39 - NVFP4 quantizationPrevious (b8191) additions still current:
llama_model_meta_key enum for sampling metadata keyskv_unified, swa_full context paramsLLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY and _ext state functionsIf you're updating old code:
llama_model_load_from_file() instead of llama_load_model_from_file()llama_model_free() instead of llama_free_model()llama_init_from_model() instead of llama_new_context_with_model()llama_vocab_*() functions instead of llama_token_*()llama_state_*() functions instead of deprecated state functionsllama_set_adapters_lora() instead of llama_set_adapter_lora() for LoRA adaptersllama_vocab_bos() instead of llama_vocab_cls() (CLS is equivalent to BOS)llama_sampler_init_grammar_lazy_patterns() instead of llama_sampler_init_grammar_lazy()See the API reference for complete mappings.