Help us improve
Share bugs, ideas, or general feedback.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ggml:ggmlThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
ggml is a minimalistic C tensor computation library powering llama.cpp and many other ML inference engines. It provides:
Provides complete llama.cpp C/C++ API reference covering model loading, inference, text generation, embeddings, chat, tokenization, sampling, batching, KV cache, LoRA adapters, and state management. For C LLM integration, workflows, and troubleshooting.
Searches Hugging Face Hub for llama.cpp-compatible GGUF models, recommends quants, launches local servers on CPU, Metal, CUDA, or ROCm with OpenAI API compatibility.
Share bugs, ideas, or general feedback.
ggml is a minimalistic C tensor computation library powering llama.cpp and many other ML inference engines. It provides:
Version: v0.9.7 Language: C (C++ optional) License: MIT Repo: https://github.com/ggml-org/ggml
#include "ggml.h"
#include "ggml-cpu.h"
#include "ggml-backend.h"
int main(void) {
struct ggml_init_params params = {
.mem_size = 64 * 1024 * 1024, // 64 MB scratch buffer
.mem_buffer = NULL,
.no_alloc = false,
};
struct ggml_context * ctx = ggml_init(params);
struct ggml_tensor * a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 4);
struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 4);
struct ggml_tensor * c = ggml_add(ctx, a, b);
struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, c);
ggml_backend_t backend = ggml_backend_cpu_init();
ggml_backend_graph_compute(backend, gf);
ggml_backend_free(backend);
ggml_free(ctx);
return 0;
}
ggml_backend_load_all() to discover available hardware| Domain | File | Description |
|---|---|---|
| Context, tensors & graphs | api-core.md | Init, create tensors, graph ops, scalar access, constants |
| Arithmetic & matrix ops | api-arithmetic.md | add/mul/matmul, reductions, loss functions, quantize |
| Activations, norms & shapes | api-activations.md | relu/gelu/silu, RMS norm, reshape/permute/concat, custom ops |
| Attention, convolution & RoPE | api-attention.md | Flash Attention, RoPE variants, 1D/2D/3D conv, pooling, padding |
| Backend, memory & scheduler | api-backend.md | Backends, buffer types, scheduler, gallocr, CPU threadpool, F16 conversions |
| GGUF file format | api-gguf.md | Read/write GGUF v3: KV metadata, tensor layout, serialization |
| Optimization & training | api-optimization.md | Datasets, AdamW/SGD optimizer, epoch loop, ggml_opt_fit |
| Working examples | workflows.md | Quick start, GGUF loading, multi-backend, attention, training, quantize |
See references/workflows.md for complete examples.
Quick reference:
mem_size generously; ggml_init fails silently if too smallne[0] is the innermost (fastest) dimension; for a [rows × cols] matrix use ne0=cols, ne1=rowsggml_backend_graph_compute() to runggml_backend_load_all() at startup; use ggml_backend_init_best() to pick the best available deviceggml_mul_mat supports mixed precision (e.g. Q4_0 weights × F32 activations) nativelyggml_add_inplace overwrites tensor a and avoids an allocation; only safe when a is not used elsewhere in the graphggml_backend_cpu_set_n_threads() or a custom threadpool