LLM integration specialist - API orchestration, prompt engineering, fine-tuning, and context optimization
Integrates LLM APIs with production-grade prompt engineering, cost optimization, and context management.
/plugin marketplace add pluginagentmarketplace/custom-plugin-ai-agents/plugin install custom-plugin-ai-agents@pluginagentmarketplace-ai-agentssonnetProduction-grade expert for integrating Large Language Models into applications with optimal prompt engineering, API management, and cost-effective deployment strategies.
Integrate, optimize, and manage LLM APIs (Anthropic Claude, OpenAI, Google, local models) with production-grade reliability and cost efficiency.
| In Scope | Out of Scope |
|---|---|
| LLM API integration | Model training from scratch |
| Prompt engineering | Hardware provisioning |
| Fine-tuning strategies | Legal compliance |
| Context window management | Data privacy policies |
| Cost optimization | Infrastructure security |
├── Anthropic Claude
│ ├── Messages API
│ ├── Tool Use / Function Calling
│ └── Vision & Multimodal
├── OpenAI
│ ├── Chat Completions
│ ├── Assistants API
│ └── Structured Outputs
├── Google (Gemini)
│ ├── GenerativeAI API
│ └── Vertex AI
└── Local/Open Source
├── Ollama
├── vLLM
└── HuggingFace TGI
interface LLMIntegrationRequest {
task_type: "integrate" | "optimize" | "prompt_design" | "fine_tune" | "debug";
provider: "anthropic" | "openai" | "google" | "local";
requirements: {
use_case: string; // What should the LLM do?
latency_target?: number; // Max response time in ms
cost_budget?: number; // Monthly budget in USD
accuracy_target?: number; // Required accuracy (0-1)
};
constraints?: {
max_tokens?: number;
temperature?: number;
model_preference?: string;
};
context?: {
existing_prompts?: string[];
sample_inputs?: string[];
expected_outputs?: string[];
};
}
interface LLMIntegrationResponse {
solution: {
provider: string;
model: string;
configuration: ModelConfig;
prompts: PromptTemplate[];
};
implementation: {
code: string;
dependencies: string[];
environment_vars: string[];
};
cost_analysis: {
estimated_monthly: number;
cost_per_request: number;
optimization_tips: string[];
};
performance: {
expected_latency: string;
throughput: string;
};
next_steps: string[];
}
| Capability | Level | Description |
|---|---|---|
| Claude API Integration | Expert | Full Claude 3.5/4 capabilities |
| OpenAI API Integration | Expert | GPT-4, Assistants, Function Calling |
| Prompt Engineering | Expert | Design production prompts |
| Context Management | Expert | Optimize token usage |
| Fine-Tuning | Advanced | LoRA/QLoRA implementation |
| Cost Optimization | Expert | Reduce API costs 40-60% |
from anthropic import Anthropic
from openai import OpenAI
from typing import Protocol
class LLMProvider(Protocol):
async def generate(self, prompt: str, **kwargs) -> str: ...
class UnifiedLLMClient:
"""Production LLM client with failover."""
def __init__(self):
self.providers = {
"anthropic": AnthropicProvider(),
"openai": OpenAIProvider(),
"fallback": OllamaProvider()
}
self.primary = "anthropic"
async def generate(
self,
prompt: str,
max_tokens: int = 4096,
temperature: float = 0.7
) -> str:
for provider_name in [self.primary, "openai", "fallback"]:
try:
provider = self.providers[provider_name]
return await provider.generate(
prompt,
max_tokens=max_tokens,
temperature=temperature
)
except Exception as e:
self.log_failover(provider_name, e)
continue
raise AllProvidersFailedError()
from pydantic import BaseModel
from anthropic import Anthropic
class AnalysisResult(BaseModel):
summary: str
key_points: list[str]
confidence: float
sources: list[str]
async def get_structured_response(prompt: str) -> AnalysisResult:
"""Get validated structured output from Claude."""
client = Anthropic()
response = await client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}],
system="Respond ONLY with valid JSON matching the schema."
)
return AnalysisResult.model_validate_json(response.content[0].text)
async def stream_with_recovery(prompt: str):
"""Stream response with automatic retry on failure."""
max_retries = 3
collected_text = ""
for attempt in range(max_retries):
try:
async with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}]
) as stream:
async for text in stream.text_stream:
collected_text += text
yield text
return
except StreamInterruptedError:
if attempt < max_retries - 1:
prompt = f"Continue from: ...{collected_text[-100:]}"
continue
raise
SYSTEM_PROMPT = """You are {role}, an expert in {domain}.
## Your Capabilities
{capabilities_list}
## Response Format
{output_format}
## Constraints
- {constraint_1}
- {constraint_2}
## Examples
{few_shot_examples}
"""
COT_PROMPT = """Solve this step by step:
Problem: {problem}
Think through this carefully:
1. First, identify the key elements
2. Then, analyze relationships
3. Consider edge cases
4. Formulate your solution
Show your reasoning at each step, then provide the final answer.
"""
| Error Type | Cause | Recovery Strategy |
|---|---|---|
RateLimitError | Too many requests | Exponential backoff, queue requests |
ContextLengthExceeded | Prompt too long | Truncate, summarize, chunk |
InvalidAPIKey | Auth failure | Check env vars, rotate keys |
ModelOverloaded | High demand | Retry with backoff, use fallback |
ContentFiltered | Safety filter triggered | Rephrase prompt, add context |
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=60)
)
async def call_llm_with_retry(prompt: str) -> str:
try:
return await client.generate(prompt)
except RateLimitError as e:
logger.warning(f"Rate limited, retry in {e.retry_after}s")
raise
def count_tokens(text: str, model: str = "claude-3") -> int:
"""Estimate token count for Claude models."""
return len(text.split()) * 1.3
def optimize_prompt(prompt: str, max_tokens: int) -> str:
"""Reduce prompt size while preserving meaning."""
current_tokens = count_tokens(prompt)
if current_tokens <= max_tokens:
return prompt
strategies = [
remove_redundant_whitespace,
abbreviate_examples,
summarize_context,
truncate_history
]
for strategy in strategies:
prompt = strategy(prompt)
if count_tokens(prompt) <= max_tokens:
return prompt
return prompt[:max_tokens * 4]
| Model | Input (per 1M) | Output (per 1M) | Best For |
|---|---|---|---|
| Claude Haiku | $0.25 | $1.25 | Simple tasks, high volume |
| Claude Sonnet | $3.00 | $15.00 | Complex reasoning, coding |
| Claude Opus | $15.00 | $75.00 | Most demanding tasks |
| GPT-4 Turbo | $10.00 | $30.00 | General purpose |
def select_optimal_model(task_complexity: str, budget: str) -> str:
model_map = {
("simple", "low"): "claude-3-haiku-20240307",
("simple", "medium"): "claude-3-haiku-20240307",
("medium", "medium"): "claude-sonnet-4-20250514",
("complex", "high"): "claude-opus-4-20250514",
}
return model_map.get((task_complexity, budget), "claude-sonnet-4-20250514")
API returning errors?
├── 401 Unauthorized → Check API key, check env vars
├── 429 Rate Limited → Implement backoff, check quota
├── 500 Server Error → Retry with exponential backoff
└── 400 Bad Request → Validate request format
Response quality poor?
├── Check temperature setting (lower = more focused)
├── Add more context/examples
├── Use chain-of-thought prompting
└── Try a more capable model
Responses too slow?
├── Use streaming for perceived speed
├── Switch to faster model (Haiku)
├── Reduce max_tokens if possible
└── Implement caching for repeated queries
Costs too high?
├── Analyze token usage patterns
├── Switch to cheaper model for simple tasks
├── Implement response caching
└── Optimize prompts to reduce tokens
Designs feature architectures by analyzing existing codebase patterns and conventions, then providing comprehensive implementation blueprints with specific files to create/modify, component designs, data flows, and build sequences