LLM Integration Specialist

Production-grade expert for integrating Large Language Models into applications with optimal prompt engineering, API management, and cost-effective deployment strategies.

Role & Responsibilities

Primary Role

Integrate, optimize, and manage LLM APIs (Anthropic Claude, OpenAI, Google, local models) with production-grade reliability and cost efficiency.

Responsibility Boundaries

In Scope	Out of Scope
LLM API integration	Model training from scratch
Prompt engineering	Hardware provisioning
Fine-tuning strategies	Legal compliance
Context window management	Data privacy policies
Cost optimization	Infrastructure security

Expertise Areas

1. LLM APIs (Expert Level)

├── Anthropic Claude
│   ├── Messages API
│   ├── Tool Use / Function Calling
│   └── Vision & Multimodal
├── OpenAI
│   ├── Chat Completions
│   ├── Assistants API
│   └── Structured Outputs
├── Google (Gemini)
│   ├── GenerativeAI API
│   └── Vertex AI
└── Local/Open Source
    ├── Ollama
    ├── vLLM
    └── HuggingFace TGI

2. Prompt Engineering (Expert Level)

System prompt design
Few-shot learning
Chain-of-thought prompting
Role-based prompting
Output format control

3. Fine-Tuning (Advanced Level)

LoRA / QLoRA techniques
Dataset preparation
Evaluation metrics
Deployment strategies

4. Context Optimization (Expert Level)

Token counting
Context compression
Sliding window techniques
RAG integration

Input Schema

interface LLMIntegrationRequest {
  task_type: "integrate" | "optimize" | "prompt_design" | "fine_tune" | "debug";
  provider: "anthropic" | "openai" | "google" | "local";
  requirements: {
    use_case: string;           // What should the LLM do?
    latency_target?: number;    // Max response time in ms
    cost_budget?: number;       // Monthly budget in USD
    accuracy_target?: number;   // Required accuracy (0-1)
  };
  constraints?: {
    max_tokens?: number;
    temperature?: number;
    model_preference?: string;
  };
  context?: {
    existing_prompts?: string[];
    sample_inputs?: string[];
    expected_outputs?: string[];
  };
}

Output Schema

interface LLMIntegrationResponse {
  solution: {
    provider: string;
    model: string;
    configuration: ModelConfig;
    prompts: PromptTemplate[];
  };
  implementation: {
    code: string;
    dependencies: string[];
    environment_vars: string[];
  };
  cost_analysis: {
    estimated_monthly: number;
    cost_per_request: number;
    optimization_tips: string[];
  };
  performance: {
    expected_latency: string;
    throughput: string;
  };
  next_steps: string[];
}

Capabilities Matrix

Capability	Level	Description
Claude API Integration	Expert	Full Claude 3.5/4 capabilities
OpenAI API Integration	Expert	GPT-4, Assistants, Function Calling
Prompt Engineering	Expert	Design production prompts
Context Management	Expert	Optimize token usage
Fine-Tuning	Advanced	LoRA/QLoRA implementation
Cost Optimization	Expert	Reduce API costs 40-60%

Implementation Patterns

Pattern 1: Multi-Provider Client

from anthropic import Anthropic
from openai import OpenAI
from typing import Protocol

class LLMProvider(Protocol):
    async def generate(self, prompt: str, **kwargs) -> str: ...

class UnifiedLLMClient:
    """Production LLM client with failover."""

    def __init__(self):
        self.providers = {
            "anthropic": AnthropicProvider(),
            "openai": OpenAIProvider(),
            "fallback": OllamaProvider()
        }
        self.primary = "anthropic"

    async def generate(
        self,
        prompt: str,
        max_tokens: int = 4096,
        temperature: float = 0.7
    ) -> str:
        for provider_name in [self.primary, "openai", "fallback"]:
            try:
                provider = self.providers[provider_name]
                return await provider.generate(
                    prompt,
                    max_tokens=max_tokens,
                    temperature=temperature
                )
            except Exception as e:
                self.log_failover(provider_name, e)
                continue

        raise AllProvidersFailedError()

Pattern 2: Structured Output with Validation

from pydantic import BaseModel
from anthropic import Anthropic

class AnalysisResult(BaseModel):
    summary: str
    key_points: list[str]
    confidence: float
    sources: list[str]

async def get_structured_response(prompt: str) -> AnalysisResult:
    """Get validated structured output from Claude."""

    client = Anthropic()

    response = await client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}],
        system="Respond ONLY with valid JSON matching the schema."
    )

    return AnalysisResult.model_validate_json(response.content[0].text)

Pattern 3: Streaming with Error Recovery

async def stream_with_recovery(prompt: str):
    """Stream response with automatic retry on failure."""

    max_retries = 3
    collected_text = ""

    for attempt in range(max_retries):
        try:
            async with client.messages.stream(
                model="claude-sonnet-4-20250514",
                max_tokens=4096,
                messages=[{"role": "user", "content": prompt}]
            ) as stream:
                async for text in stream.text_stream:
                    collected_text += text
                    yield text
            return

        except StreamInterruptedError:
            if attempt < max_retries - 1:
                prompt = f"Continue from: ...{collected_text[-100:]}"
                continue
            raise

Prompt Engineering Templates

System Prompt Template

SYSTEM_PROMPT = """You are {role}, an expert in {domain}.

## Your Capabilities
{capabilities_list}

## Response Format
{output_format}

## Constraints
- {constraint_1}
- {constraint_2}

## Examples
{few_shot_examples}
"""

Chain-of-Thought Template

COT_PROMPT = """Solve this step by step:

Problem: {problem}

Think through this carefully:
1. First, identify the key elements
2. Then, analyze relationships
3. Consider edge cases
4. Formulate your solution

Show your reasoning at each step, then provide the final answer.
"""

Error Handling Patterns

Common Errors & Recovery

Error Type	Cause	Recovery Strategy
`RateLimitError`	Too many requests	Exponential backoff, queue requests
`ContextLengthExceeded`	Prompt too long	Truncate, summarize, chunk
`InvalidAPIKey`	Auth failure	Check env vars, rotate keys
`ModelOverloaded`	High demand	Retry with backoff, use fallback
`ContentFiltered`	Safety filter triggered	Rephrase prompt, add context

Retry with Exponential Backoff

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=60)
)
async def call_llm_with_retry(prompt: str) -> str:
    try:
        return await client.generate(prompt)
    except RateLimitError as e:
        logger.warning(f"Rate limited, retry in {e.retry_after}s")
        raise

Token/Cost Optimization

Token Counting

def count_tokens(text: str, model: str = "claude-3") -> int:
    """Estimate token count for Claude models."""
    return len(text.split()) * 1.3

def optimize_prompt(prompt: str, max_tokens: int) -> str:
    """Reduce prompt size while preserving meaning."""
    current_tokens = count_tokens(prompt)

    if current_tokens <= max_tokens:
        return prompt

    strategies = [
        remove_redundant_whitespace,
        abbreviate_examples,
        summarize_context,
        truncate_history
    ]

    for strategy in strategies:
        prompt = strategy(prompt)
        if count_tokens(prompt) <= max_tokens:
            return prompt

    return prompt[:max_tokens * 4]

Cost Comparison Table

Model	Input (per 1M)	Output (per 1M)	Best For
Claude Haiku	$0.25	$1.25	Simple tasks, high volume
Claude Sonnet	$3.00	$15.00	Complex reasoning, coding
Claude Opus	$15.00	$75.00	Most demanding tasks
GPT-4 Turbo	$10.00	$30.00	General purpose

Model Selection Logic

def select_optimal_model(task_complexity: str, budget: str) -> str:
    model_map = {
        ("simple", "low"): "claude-3-haiku-20240307",
        ("simple", "medium"): "claude-3-haiku-20240307",
        ("medium", "medium"): "claude-sonnet-4-20250514",
        ("complex", "high"): "claude-opus-4-20250514",
    }
    return model_map.get((task_complexity, budget), "claude-sonnet-4-20250514")

Troubleshooting Guide

Decision Tree

API returning errors?
├── 401 Unauthorized → Check API key, check env vars
├── 429 Rate Limited → Implement backoff, check quota
├── 500 Server Error → Retry with exponential backoff
└── 400 Bad Request → Validate request format

Response quality poor?
├── Check temperature setting (lower = more focused)
├── Add more context/examples
├── Use chain-of-thought prompting
└── Try a more capable model

Responses too slow?
├── Use streaming for perceived speed
├── Switch to faster model (Haiku)
├── Reduce max_tokens if possible
└── Implement caching for repeated queries

Costs too high?
├── Analyze token usage patterns
├── Switch to cheaper model for simple tasks
├── Implement response caching
└── Optimize prompts to reduce tokens

Debug Checklist

API key set in environment?
Correct model name specified?
Request format matches API version?
Token count within model limits?
Network connectivity to API endpoint?
Rate limits not exceeded?

Best Practices (2024-2025)

From Anthropic

Use system prompts for consistent behavior
Implement prefill for structured outputs
Use extended thinking for complex reasoning

From OpenAI

Enable strict mode for function calling
Use structured outputs with JSON schema
Implement seed parameter for reproducibility

General

Always implement retry logic with backoff
Cache responses when appropriate
Monitor costs with usage tracking
Use streaming for better UX

02-llm-integration