AI Inference & Model Serving | ripple-env | ClaudePluginHub

Skill

AI Inference & Model Serving

From ripple-env

AI model inference and serving. Activate when: (1) Setting up LocalAI or vLLM, (2) Configuring model serving, (3) Working with GGUF/GGML models, (4) Implementing inference pipelines, or (5) Optimizing model performance.

Install

$

npx claudepluginhub flexnetos/ripple-env

Tool Access

This skill uses the workspace's default tool permissions.

Preview

This skill covers local AI inference using LocalAI, vLLM, and other model serving frameworks for running LLMs and other AI models.

SKILL.md

Similar Skills

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

159.9k

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

159.9k

next-compile

1 file

Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.

vercel-next-js-2

139.2k

Stats

Stars0

Forks0

Last CommitJan 9, 2026

Actions

View Source View Plugin View on GitHub View README

AI Inference & Model Serving

Overview

This skill covers local AI inference using LocalAI, vLLM, and other model serving frameworks for running LLMs and other AI models.

LocalAI

Installation

# Docker (recommended)
docker run -p 8080:8080 \
  -v $PWD/models:/models \
  localai/localai:latest-cpu

# With GPU support
docker run --gpus all -p 8080:8080 \
  -v $PWD/models:/models \
  localai/localai:latest-gpu-nvidia-cuda-12

Model Configuration

# models/llama.yaml
name: llama
backend: llama-cpp
parameters:
  model: /models/llama-2-7b-chat.Q4_K_M.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  context_size: 4096
  threads: 4
  gpu_layers: 35  # Offload layers to GPU

# Template for chat
template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}
    {{else if eq .Role "assistant"}}Assistant: {{.Content}}
    {{end}}
    {{end}}
    Assistant:

API Usage

# Chat completion (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

# Text completion
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "prompt": "Once upon a time",
    "max_tokens": 100
  }'

# Embeddings
curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "Hello world"
  }'

Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # LocalAI doesn't require API key
)

response = client.chat.completions.create(
    model="llama",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

vLLM

Installation

pip install vllm

# Or with specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

Server Mode

# Start server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --port 8000 \
  --tensor-parallel-size 1

# With quantization
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-7B-Chat-AWQ \
  --quantization awq

Python API

from vllm import LLM, SamplingParams

# Initialize model
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256
)

# Generate
prompts = ["Hello, my name is", "The capital of France is"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")

GGUF/GGML Models

Model Formats

Format	Description
GGUF	New format, recommended
GGML	Legacy format
Q4_K_M	4-bit quantization, medium quality
Q5_K_M	5-bit quantization, better quality
Q8_0	8-bit quantization, near FP16
F16	Full 16-bit floating point

Quantization with llama.cpp

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Convert to GGUF
python convert.py /path/to/hf-model --outfile model.gguf

# Quantize
./quantize model.gguf model-q4_k_m.gguf Q4_K_M

# Run inference
./main -m model-q4_k_m.gguf \
  -p "Hello, how are you?" \
  -n 256 \
  --temp 0.7

Model Optimization

GPU Memory Management

# vLLM memory management
llm = LLM(
    model="model-name",
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory
    max_model_len=4096,          # Max context length
    enforce_eager=False          # Use CUDA graphs
)

Batching Strategies

# Continuous batching (vLLM default)
# Dynamically batch requests for better throughput

# Static batching
sampling_params = SamplingParams(max_tokens=256)
outputs = llm.generate(
    prompts,  # List of prompts
    sampling_params,
    use_tqdm=True
)

Speculative Decoding

# Use draft model for faster inference
llm = LLM(
    model="meta-llama/Llama-2-70b-chat-hf",
    speculative_model="meta-llama/Llama-2-7b-chat-hf",
    num_speculative_tokens=5
)

Monitoring & Metrics

Prometheus Metrics

# LocalAI exposes /metrics endpoint
# docker-compose.yaml
services:
  localai:
    image: localai/localai:latest
    ports:
      - "8080:8080"
    environment:
      - METRICS=true

  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

Key Metrics

Metric	Description
`request_latency`	Time to generate response
`tokens_per_second`	Generation speed
`queue_depth`	Pending requests
`gpu_memory_used`	GPU memory usage
`batch_size`	Current batch size

Feature Flags (A/B Testing)

# pixi.toml feature flags
[feature.inference-localai]
[feature.inference-localai.dependencies]
# LocalAI specific deps

[feature.inference-vllm]
[feature.inference-vllm.dependencies]
vllm = ">=0.3.0"
torch = ">=2.0"

[environments]
default = { features = ["inference-localai"] }
vllm = { features = ["inference-vllm"] }

External Links