Skill

llamacpp

Provides complete llama.cpp C/C++ API reference covering model loading, inference, text generation, embeddings, chat, tokenization, sampling, batching, KV cache, LoRA adapters, and state management. For C LLM integration, workflows, and troubleshooting.

C++

ai-ml

npx claudepluginhub datathings/marketplace --plugin llamacpp

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/llamacpp:llamacpp

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.

Supporting Files

references/api-advanced.mdreferences/api-context.mdreferences/api-core.mdreferences/api-inference.mdreferences/api-model-info.mdreferences/api-sampling.mdreferences/api.mdreferences/workflows.md

SKILL.md

195 lines · ~2.4k tokens

Similar Skills

llamafile

Configures Mozilla Llamafile to run GGUF models locally with OpenAI-compatible API. Manages installation, server startup, GPU/CPU configs, SDK integrations, and troubleshooting.

llamafile

local-llm-expert

38.3k

Optimizes local LLM inference, model selection, VRAM usage, and deployment using Ollama, llama.cpp, vLLM, LM Studio. Covers GGUF/EXL2 quantization and privacy-first setups for offline AI apps.

antigravity-awesome-skills

litellm

Unifies Python LLM API calls to 100+ providers (OpenAI, Anthropic, Ollama, llamafile) in OpenAI format with retries, fallbacks, exceptions, cost tracking. Triggers on litellm imports/completion().

litellm

Stats

LanguageShell

Parent stars10

Parent forks1

MaintenanceExcellent

Last CommitMar 13, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

llama.cpp C API Guide

Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.

Overview

llama.cpp is a C/C++ implementation for LLM inference with minimal dependencies and state-of-the-art performance. This skill provides:

Complete API Reference: All non-deprecated functions organized by category
Common Workflows: Working examples for typical use cases
Best Practices: Patterns for efficient and correct API usage

Quick Start

See references/workflows.md for complete working examples. Basic workflow:

llama_backend_init() - Initialize backend
llama_model_load_from_file() - Load model
llama_init_from_model() - Create context
llama_tokenize() - Convert text to tokens
llama_decode() - Process tokens
llama_sampler_sample() - Sample next token
Cleanup in reverse order

When to Use This Skill

Use this skill when:

API Lookup: You need to find a specific function (e.g., "How do I load a model?", "What function creates a context?")
Code Generation: You're writing C code that uses llama.cpp
Workflow Guidance: You need to understand the steps for a task (e.g., text generation, embeddings, chat)
Advanced Features: You're working with batches, sequences, LoRA adapters, state management, or custom sampling
Migration: You're updating code from deprecated functions to current API

Core Concepts

Key Objects

llama_model: Loaded model weights and architecture
llama_context: Inference state (KV cache, compute buffers)
llama_batch: Input tokens and positions for processing
llama_sampler: Token sampling configuration
llama_vocab: Vocabulary and tokenizer
llama_memory_t: KV cache memory handle

Typical Flow

Initialize: llama_backend_init()
Load Model: llama_model_load_from_file()
Create Context: llama_init_from_model()
Tokenize: llama_tokenize()
Process: llama_encode() or llama_decode()
Sample: llama_sampler_sample()
Generate: Repeat steps 5-6
Cleanup: Free in reverse order

API Reference

For detailed API documentation, the complete API is split across 6 files for efficient targeted loading. Start with references/api-core.md which links to all other sections.

API Files:

api-core.md (248 lines) - Initialization, parameters, model loading
api-model-info.md (223 lines) - Model properties, architecture detection, metadata enums
api-context.md (421 lines) - Context, memory (KV cache), state management
api-inference.md (418 lines) - Batch operations, inference, tokenization, chat
api-sampling.md (490 lines) - All 20+ sampling strategies (incl. adaptive-p) + backend sampling API
api-advanced.md (402 lines) - LoRA adapters, performance, training, constants

Total: ~197 active functions (b8305) across 6 organized files

Quick Function Lookup

Most common: llama_backend_init(), llama_model_load_from_file(), llama_init_from_model(), llama_tokenize(), llama_decode(), llama_sampler_sample(), llama_vocab_is_eog(), llama_memory_clear()

See references/api.md for all ~197 function signatures.

Common Workflows

See references/workflows.md for 13 complete working examples: basic text generation, chat, embeddings, batch processing, multi-sequence, LoRA, state save/load, custom sampling (XTC/DRY), encoder-decoder models, model detection, and memory management patterns.

Best Practices

See references/workflows.md for detailed best practices. Key points:

Always use default parameter functions (llama_model_default_params(), etc.)
Check return values for errors
Free resources in reverse order of creation
Handle dynamic buffer sizes for tokenization
Query actual context size after creation (llama_n_ctx())
Check for end-of-generation with llama_vocab_is_eog()

Common Patterns

End-of-generation check (llama_vocab_is_eog()), logits retrieval (llama_get_logits_ith()), batch creation (llama_batch_get_one()), tokenization buffer handling. See references/workflows.md for complete code examples.

Troubleshooting

Common Issues

Model loading fails:

Verify file path and GGUF format validity
Check available RAM/VRAM for model size
Reduce n_gpu_layers if GPU memory insufficient

Tokenization returns negative value:

Buffer too small; reallocate with -n size and retry
See tokenization pattern in Common Patterns

Decode/encode returns non-zero:

Verify batch initialization (llama_batch_get_one() or llama_batch_init())
Check context capacity (llama_n_ctx())
Ensure positions within context window

Silent failures / no output:

Check if llama_vocab_is_eog() immediately returns true
Verify sampler initialization
Enable logging: llama_log_set()

Performance issues:

Increase n_threads for CPU
Set n_gpu_layers for GPU offloading
Use larger n_batch for prompts
See Performance & Utilities

Sliding Window Attention (SWA) issues:

If using Mistral-style models with SWA, set ctx_params.swa_full = true to access beyond attention window
Check: llama_model_n_swa(model) to detect SWA size and configuration needs
Symptoms: Token positions beyond window size causing decode errors

Per-sequence state errors:

Ensure sequence ID matches when loading: llama_state_seq_load_file(ctx, "file", dest_seq_id, ...)
Verify token buffer is large enough for loaded tokens
Check sequence wasn't cleared or removed before loading state

Model type detection:

Use llama_model_has_encoder() before assuming decoder-only architecture
For recurrent models (Mamba/RWKV), KV cache behavior differs from standard transformers
Encoder-decoder models require llama_encode() then llama_decode() workflow

For advanced issues: https://github.com/ggerganov/llama.cpp/discussions

Resources

API Reference (6 files, 2,202 lines total) - Complete API reference split by category for targeted loading:
- api-core.md - Initialization, parameters, model loading
- api-model-info.md - Model properties, architecture detection, metadata enums
- api-context.md - Context, memory, state management
- api-inference.md - Batch, inference, tokenization, chat
- api-sampling.md - All 20+ sampling strategies (incl. adaptive-p) + backend sampling API
- api-advanced.md - LoRA, performance, training, constants
references/workflows.md (1,613 lines) - 15 complete working examples: basic workflows (text generation, chat, embeddings, batching, sequences), intermediate (LoRA, state, sampling, encoder-decoder, memory), advanced features (XTC/DRY, per-sequence state, model detection), and production applications (interactive chat, streaming).

What's New in b8305

New Functions:

llama_model_init_from_user() - Create models from GGUF metadata with custom tensor data callbacks

New Model Params:

use_direct_io (bool) - Use direct I/O, takes precedence over use_mmap when supported
no_alloc (bool) - Only load metadata and simulate memory allocations

New Enum Values:

LLAMA_VOCAB_TYPE_PLAMO2 = 6 - PLaMo-2 tokenizer based on Aho-Corasick with dynamic programming
LLAMA_FTYPE_MOSTLY_MXFP4_MOE = 38 - MXFP4 quantization for MoE models
LLAMA_FTYPE_MOSTLY_NVFP4 = 39 - NVFP4 quantization

Previous (b8191) additions still current:

llama_model_meta_key enum for sampling metadata keys
kv_unified, swa_full context params
LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY and _ext state functions
Backend sampling API [EXPERIMENTAL]
Adaptive-P sampler

Key Differences from Deprecated API

If you're updating old code:

Use llama_model_load_from_file() instead of llama_load_model_from_file()
Use llama_model_free() instead of llama_free_model()
Use llama_init_from_model() instead of llama_new_context_with_model()
Use llama_vocab_*() functions instead of llama_token_*()
Use llama_state_*() functions instead of deprecated state functions
Use llama_set_adapters_lora() instead of llama_set_adapter_lora() for LoRA adapters
Use llama_vocab_bos() instead of llama_vocab_cls() (CLS is equivalent to BOS)
Use llama_sampler_init_grammar_lazy_patterns() instead of llama_sampler_init_grammar_lazy()

See the API reference for complete mappings.

llamacpp

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

llamacpp

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

llama.cpp C API Guide

Overview

Quick Start

When to Use This Skill

Core Concepts

Key Objects

Typical Flow

API Reference

Quick Function Lookup

Common Workflows

Best Practices

Common Patterns

Troubleshooting

Common Issues

Resources

What's New in b8305

Key Differences from Deprecated API

Similar Skills

Help us improve

llama.cpp C API Guide

Overview

Quick Start

When to Use This Skill

Core Concepts

Key Objects

Typical Flow

API Reference

Quick Function Lookup

Common Workflows

Best Practices

Common Patterns

Troubleshooting

Common Issues

Resources

What's New in b8305

Key Differences from Deprecated API