Ollama Mastery Skill

This skill provides comprehensive knowledge for Ollama integration in the Ahling Command Center, including model management, GPU optimization for AMD RX 7900 XTX, custom Modelfiles, and multi-model orchestration.

Trigger Phrases

"deploy ollama", "setup ollama", "configure ollama"
"pull model", "download model", "ollama models"
"gpu allocation", "vram management", "rocm setup"
"custom modelfile", "fine-tune ollama"
"model routing", "multi-model", "load balancing"
"embeddings", "vector generation"

Hardware Context

target_gpu: AMD RX 7900 XTX
vram_total: 24GB
rocm_version: ">=6.0"

vram_allocation:
  primary_model: 16GB      # Large models (70B Q4, 34B)
  secondary_model: 4GB     # Fast models (7B, 3B)
  embeddings: 2GB          # nomic-embed-text
  reserved: 2GB            # Whisper, Frigate overlap

Model Routing Strategy

Route requests to appropriate models based on task complexity:

Task Type	Model	VRAM	Use Case
Complex Reasoning	llama3.2:70b-q4	16GB	Planning, analysis, code review
Quick Response	llama3.2:7b	4GB	Simple queries, fast interactions
Code Generation	codellama:34b-q4	12GB	Code writing, debugging
Home Assistant	fixt/home-3b-v3	2GB	HA entity control, automation
Embeddings	nomic-embed-text	1GB	Vector generation for RAG
Vision	llava:13b	8GB	Image analysis (when needed)

Ollama API Reference

Generate Text

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:70b",
  "prompt": "Explain quantum computing",
  "stream": false,
  "options": {
    "temperature": 0.7,
    "num_ctx": 8192,
    "num_gpu": 99
  }
}'

Chat Completion

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:7b",
  "messages": [
    {"role": "system", "content": "You are the Ahling Command Center AI."},
    {"role": "user", "content": "What is the status of my home?"}
  ],
  "stream": true
}'

Generate Embeddings

curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Home Assistant automation for motion-activated lights"
}'

Model Management

# List models
curl http://localhost:11434/api/tags

# Pull model
curl http://localhost:11434/api/pull -d '{"name": "llama3.2:70b"}'

# Delete model
curl http://localhost:11434/api/delete -d '{"name": "old-model"}'

# Show model info
curl http://localhost:11434/api/show -d '{"name": "llama3.2:70b"}'

Custom Modelfiles

Ahling Home Assistant Model

# Modelfile.ahling-home
FROM fixt/home-3b-v3

PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER stop "<|im_end|>"

SYSTEM """
You are the Ahling Command Center AI, integrated with Home Assistant.
You control a smart home with these capabilities:
- Lights in all rooms (living room, bedroom, office, kitchen, garage)
- Climate control (HVAC, fans)
- Security (cameras, locks, motion sensors)
- Media (TV, speakers)
- Energy monitoring (solar, battery, consumption)

When asked to control devices, respond with the exact service call needed.
Be concise and action-oriented.
"""

Ahling Coordinator Model

# Modelfile.ahling-coordinator
FROM llama3.2:7b

PARAMETER temperature 0.8
PARAMETER num_ctx 8192
PARAMETER top_p 0.9

SYSTEM """
You are the Ahling Command Center Coordinator, responsible for:
1. Orchestrating multi-agent workflows
2. Synthesizing information from multiple sources
3. Making decisions that affect the entire home system
4. Providing morning briefings and status reports

You have access to:
- Home Assistant for physical control
- Knowledge graph (Neo4j) for context
- Vector database (Qdrant) for semantic search
- Multiple specialist agents

Always think step-by-step and explain your reasoning.
"""

ROCm GPU Optimization

Environment Variables

# ROCm for AMD GPU
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export OLLAMA_NUM_GPU=99
export OLLAMA_GPU_OVERHEAD=256m
export OLLAMA_MAX_LOADED_MODELS=3

# Memory optimization
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0

Docker Compose with ROCm

services:
  ollama:
    image: ollama/ollama:rocm
    container_name: ollama
    devices:
      - /dev/kfd
      - /dev/dri
    volumes:
      - ollama_data:/root/.ollama
      - ./modelfiles:/modelfiles
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.0.0
      - OLLAMA_NUM_GPU=99
      - OLLAMA_FLASH_ATTENTION=1
    ports:
      - "11434:11434"
    group_add:
      - video
      - render
    security_opt:
      - seccomp:unconfined
    cap_add:
      - SYS_PTRACE

Multi-Model Orchestration

Load Balancing Strategy

class ModelRouter:
    """Route requests to appropriate Ollama models."""

    ROUTING_TABLE = {
        "complex": "llama3.2:70b",      # Complex reasoning
        "fast": "llama3.2:7b",           # Quick responses
        "code": "codellama:34b",         # Code tasks
        "home": "ahling-home",           # Home Assistant
        "embed": "nomic-embed-text",     # Embeddings
        "vision": "llava:13b",           # Image analysis
    }

    def route(self, task_type: str, complexity: float = 0.5) -> str:
        """Select model based on task type and complexity."""
        if task_type == "auto":
            if complexity > 0.7:
                return self.ROUTING_TABLE["complex"]
            else:
                return self.ROUTING_TABLE["fast"]
        return self.ROUTING_TABLE.get(task_type, "llama3.2:7b")

Concurrent Model Loading

# Maximum 3 models loaded simultaneously
# Priority order: home (always), fast (high), complex (on-demand)

model_priority:
  1: ahling-home      # Always loaded for HA control
  2: llama3.2:7b      # Fast responses, always ready
  3: llama3.2:70b     # Load on-demand for complex tasks

unload_strategy:
  idle_timeout: 300   # Unload after 5 minutes idle
  priority_keep: 2    # Keep top 2 priority models loaded

Integration Patterns

With Home Assistant

async def ha_control_with_ollama(user_request: str):
    """Process voice command through Ollama for HA control."""

    # Use the home-optimized model
    response = await ollama.chat(
        model="ahling-home",
        messages=[
            {"role": "user", "content": user_request}
        ]
    )

    # Parse the service call from response
    service_call = parse_ha_service(response["message"]["content"])

    # Execute on Home Assistant
    await ha.call_service(**service_call)

With Microsoft Agents

# Register Ollama as LLM backend for AutoGen
config_list = [
    {
        "model": "llama3.2:70b",
        "base_url": "http://ollama:11434/v1",
        "api_type": "ollama",
        "api_key": "ollama",  # Placeholder
    }
]

# Create AutoGen agent with Ollama
coordinator = AssistantAgent(
    name="coordinator",
    llm_config={"config_list": config_list},
    system_message="You are the Ahling Command Center coordinator..."
)

With RAG Pipeline

async def rag_with_ollama(query: str):
    """RAG query using Ollama embeddings and generation."""

    # Generate query embedding
    query_embedding = await ollama.embeddings(
        model="nomic-embed-text",
        prompt=query
    )

    # Search Qdrant
    results = await qdrant.search(
        collection="knowledge",
        query_vector=query_embedding,
        limit=5
    )

    # Generate response with context
    context = "\n".join([r.payload["text"] for r in results])
    response = await ollama.chat(
        model="llama3.2:70b",
        messages=[
            {"role": "system", "content": f"Context:\n{context}"},
            {"role": "user", "content": query}
        ]
    )

    return response["message"]["content"]

Troubleshooting

GPU Not Detected

# Check ROCm installation
rocm-smi

# Verify device access
ls -la /dev/kfd /dev/dri

# Check Ollama GPU usage
curl http://localhost:11434/api/ps

Out of VRAM

# Unload unused models
curl http://localhost:11434/api/generate -d '{
  "model": "large-model",
  "keep_alive": 0
}'

# Check current VRAM usage
rocm-smi --showmeminfo vram

Slow Inference

# Enable flash attention
export OLLAMA_FLASH_ATTENTION=1

# Reduce context length
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:70b",
  "options": {"num_ctx": 4096}
}'

Best Practices

Always specify num_gpu: Set to 99 to use all available VRAM
Use appropriate context lengths: 4K for simple, 8K for complex
Preload priority models: Keep home and fast models loaded
Monitor VRAM: Use rocm-smi to track usage
Use streaming: Enable for real-time responses
Batch embeddings: Process multiple texts in batches
Custom Modelfiles: Create task-specific models for better performance

Related Skills

[[home-assistant-brain]] - HA integration patterns
[[microsoft-agents]] - Multi-agent orchestration
[[perception-pipeline]] - Voice pipeline (Whisper + Piper)

Ollama Mastery Skill

Ollama Mastery Skill

Trigger Phrases

Hardware Context

Model Routing Strategy

Ollama API Reference

Generate Text

Chat Completion

Generate Embeddings

Model Management

Custom Modelfiles

Ahling Home Assistant Model

Ahling Coordinator Model

ROCm GPU Optimization

Environment Variables

Docker Compose with ROCm

Multi-Model Orchestration

Load Balancing Strategy

Concurrent Model Loading

Integration Patterns

With Home Assistant

With Microsoft Agents

With RAG Pipeline

Troubleshooting

GPU Not Detected

Out of VRAM

Slow Inference

Best Practices

Related Skills

References

Similar Skills