Complete llama.cpp C/C++ API reference covering model loading, inference, text generation, embeddings, chat, tokenization, sampling, batching, KV cache, LoRA adapters, and state management. Triggers on: llama.cpp questions, LLM inference code, GGUF models, local AI/ML inference, C/C++ LLM integration, "how do I use llama.cpp", API function lookups, implementation questions, troubleshooting llama.cpp issues, and any llama-cpp or ggerganov/llama.cpp mentions.
Provides complete llama.cpp C API reference covering model loading, inference, sampling, and workflows. Triggers on llama.cpp questions, LLM inference code, GGUF models, local AI/ML integration, API lookups, and troubleshooting.
/plugin marketplace add datathings/marketplace/plugin install llamacpp@datathingsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
references/api-advanced.mdreferences/api-context.mdreferences/api-core.mdreferences/api-inference.mdreferences/api-model-info.mdreferences/api-sampling.mdreferences/api.mdreferences/workflows.mdComprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.
llama.cpp is a C/C++ implementation for LLM inference with minimal dependencies and state-of-the-art performance. This skill provides:
See references/workflows.md for complete working examples. Basic workflow:
llama_backend_init() - Initialize backendllama_model_load_from_file() - Load modelllama_init_from_model() - Create contextllama_tokenize() - Convert text to tokensllama_decode() - Process tokensllama_sampler_sample() - Sample next tokenUse this skill when:
llama_model: Loaded model weights and architecturellama_context: Inference state (KV cache, compute buffers)llama_batch: Input tokens and positions for processingllama_sampler: Token sampling configurationllama_vocab: Vocabulary and tokenizerllama_memory_t: KV cache memory handlellama_backend_init()llama_model_load_from_file()llama_init_from_model()llama_tokenize()llama_encode() or llama_decode()llama_sampler_sample()For detailed API documentation, the complete API is split across 6 files for efficient targeted loading. Start with references/api-core.md which links to all other sections.
API Files:
Total: 173 active, non-deprecated functions (b7658) across 6 organized files
Most common: llama_backend_init(), llama_model_load_from_file(), llama_init_from_model(), llama_tokenize(), llama_decode(), llama_sampler_sample(), llama_vocab_is_eog(), llama_memory_clear()
See references/api.md for all 172 function signatures and detailed usage.
See references/workflows.md for 13 complete working examples: basic text generation, chat, embeddings, batch processing, multi-sequence, LoRA, state save/load, custom sampling (XTC/DRY), encoder-decoder models, model detection, and memory management patterns.
See references/workflows.md for detailed best practices. Key points:
llama_model_default_params(), etc.)llama_n_ctx())llama_vocab_is_eog()End-of-generation check (llama_vocab_is_eog()), logits retrieval (llama_get_logits_ith()), batch creation (llama_batch_get_one()), tokenization buffer handling. See references/workflows.md for complete code examples.
Model loading fails:
n_gpu_layers if GPU memory insufficientTokenization returns negative value:
-n size and retryDecode/encode returns non-zero:
llama_batch_get_one() or llama_batch_init())llama_n_ctx())Silent failures / no output:
llama_vocab_is_eog() immediately returns truellama_log_set()Performance issues:
n_threads for CPUn_gpu_layers for GPU offloadingn_batch for promptsSliding Window Attention (SWA) issues:
ctx_params.swa_full = true to access beyond attention windowllama_model_n_swa(model) to detect SWA size and configuration needsPer-sequence state errors:
llama_state_seq_load_file(ctx, "file", dest_seq_id, ...)Model type detection:
llama_model_has_encoder() before assuming decoder-only architecturellama_encode() then llama_decode() workflowFor advanced issues: https://github.com/ggerganov/llama.cpp/discussions
If you're updating old code:
llama_model_load_from_file() instead of llama_load_model_from_file()llama_model_free() instead of llama_free_model()llama_init_from_model() instead of llama_new_context_with_model()llama_vocab_*() functions instead of llama_token_*()llama_state_*() functions instead of deprecated state functionsSee the API reference for complete mappings.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.