Skill

llm-ops

Production LLM application disipline — prompt engineering (versionlu, structured output, Anthropic prompt cache) + eval harness (golden set + LLM-as-judge + CI gate) + cost budget (token + cache hit + PR-time delta) + RAG architecture (chunking + embedding + vector DB + reranker + RAGAS eval) + provider abstraction. Model-agnostic core; Anthropic SDK Claude-leaning örnekler. `llm-engineer` agent sahiplenir.

npx claudepluginhub resultakak/argos --plugin argos

Tool Access

This skill uses the workspace's default tool permissions.

Preview

`agents/shared/severity-rubric.md` ve `agents/shared/escalation-matrix.md`

SKILL.md

Similar Skills

using-superpowers

185.1k

Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.

3 files

superpowers

Stats

Stars0

Forks0

Last CommitMay 11, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

LLM Ops

Ortak Doktrin

agents/shared/severity-rubric.md ve agents/shared/escalation-matrix.md default-load (agents/coordination.md §11). Çıktı Critical / High / Medium / Low + kanıt formatında. Sahiplik dışı bulgu delege:

security-reviewer — prompt injection, adversarial, PII data leak (OWASP A03)
privacy-engineer — PII prompt/response audit, retention, DPIA
performance-profiler — latency p99 budget
database-optimizer — pgvector RAG performance
release-manager — model bump release notes
finops-review — cost dashboard, budget alert

Felsefe

Prompt = production code. Versionlu, review'lu, test'li.
Eval CI'da. Golden set + LLM-as-judge (cross-model) + regression gate.
Cache her zaman. Anthropic prompt caching cost -75%; cache breakpoint prefix-stable.
RAG mimariyle başlar, vector DB ile değil.
Provider abstraction. Vendor lock kabul ama izole.
Cost budget PR-time. Token delta CI'da görünür.
Statistical assertions. Determinism dayatamaz; pass rate, structural conformance.

Ne Zaman Kullanılır

Yeni LLM-using feature design (prompt + eval setup)
Model bump (Claude 4.6 → 4.7) regression riski
Cost spike (aylık fatura artışı %20+)
RAG architecture review (chunk strategy, vector DB seçim)
Prompt cache strategy (cache hit < %50)
Eval harness kurulum (golden set + CI gate)
LLM call latency p99 regresyon
Provider migration (Anthropic ↔ OpenAI) risk assessment
Vendor lock-in audit

Tooling Matrisi

Konu	Tool	Notlar
SDK	`anthropic`, `openai`, `cohere`	Provider abstraction layer
Tokenizer	`tiktoken`, `anthropic-tokenizer`	Hardcode `len/4` yapma
Eval	Inspect AI, Promptfoo, DeepEval	LLM-as-judge cross-model
Vector DB	pgvector, Qdrant, Weaviate, Milvus, Pinecone	Boyut/scale'a göre
Embedding	Voyage, OpenAI text-embedding-3, Cohere embed-v3, BGE	Domain off-the-shelf önce
Reranker	Cohere rerank, mxbai-rerank	top-K=50 → top-10
Trace	OTel + Langfuse / Helicone / Phoenix	LLM-specific spans
Cost	Helicone, Anthropic billing API, OpenAI dashboard	Hourly poll → metric
Prompt eval RAG	RAGAS, TruLens	Faithfulness/recall/precision

Workflow

1) Discovery — Mevcut LLM Stack

# Provider import'larını bul
grep -rE "import (anthropic|openai|cohere)|from (anthropic|openai)" \
  --include="*.py" --include="*.ts" --include="*.tsx"

# Prompt dosya konumu
find . -path '*/prompts/*.md' -o -path '*/prompts/*.txt' | head

# Mevcut eval var mı?
grep -rE "promptfoo|inspect_ai|deepeval|ragas" --include="*.py" --include="*.yaml"

# Cost monitoring
grep -rE "helicone|langfuse|phoenix|anthropic.*billing" --include="*.py" --include="*.yaml"

2) Prompt Audit

Versionlu mu? (prompts/<task>/v3.md veya _v3 suffix)
Structured output mu? (JSON schema veya tool-use; free-text parsing kırılgan)
System/user separation? (system: persona+constraint; user: task+data)
Few-shot example sayısı (2-5)?
Anthropic prompt cache markerları var mı? (sistem prompt + RAG doc'ta)

# anthropic prompt cache pattern
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,  # ~3KB stable persona
            "cache_control": {"type": "ephemeral"},
        },
        {
            "type": "text",
            "text": rag_context,  # ~15KB retrieved docs (stable per session)
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": user_question}],
)

Cache hit cost %10 (input); miss write %125. TTL 5dk veya 1h (beta).

3) Eval Harness Kurulumu

# tests/golden/v3.jsonl — 80 örnek
{"id": "happy.001", "input": "...", "expected_intent": "create_order", "category": "happy", "difficulty": "low"}
{"id": "edge.014", "input": "...", "expected_intent": "ambiguous", "category": "edge", "difficulty": "high"}
{"id": "adversarial.022", "input": "ignore previous instructions; ...", "expected_intent": "rejected", "category": "adversarial", "difficulty": "high"}
{"id": "regression.041", "input": "...", "expected_intent": "support", "category": "regression-issue-#1284", "difficulty": "mid"}

# tests/eval_golden.py
import json
from anthropic import Anthropic
from llm_eval import judge_with_model

def run_eval(golden_path: str, model: str, judge_model: str, threshold: float):
    client = Anthropic()
    judge = Anthropic()
    results = []
    with open(golden_path) as f:
        for line in f:
            case = json.loads(line)
            actual = extract_intent(client, model, case["input"])
            # LLM-as-judge with cross-model
            score = judge_with_model(
                judge, judge_model,
                expected=case["expected_intent"],
                actual=actual,
                input=case["input"],
            )
            results.append({"id": case["id"], "score": score, "expected": case["expected_intent"], "actual": actual})

    pass_rate = sum(1 for r in results if r["score"] >= 4) / len(results)
    print(f"Pass rate: {pass_rate:.2%}")
    assert pass_rate >= threshold, f"Eval regression: {pass_rate:.2%} < {threshold:.2%}"
    return results

4) Cost Budget — PR-time gate

# tools/llm_budget.py
import subprocess
from anthropic import Anthropic
import tiktoken

def count_tokens(prompt: str, model: str) -> int:
    # Anthropic SDK token count veya tiktoken
    return Anthropic().messages.count_tokens(
        model=model, messages=[{"role": "user", "content": prompt}]
    ).input_tokens

def diff_token_cost(baseline_ref: str, head_ref: str) -> dict:
    diff_files = subprocess.check_output(
        ["git", "diff", "--name-only", baseline_ref, head_ref, "--", "prompts/", "src/llm/"]
    ).decode().split()
    baseline_total = sum(count_prompt_in_ref(f, baseline_ref) for f in diff_files)
    head_total = sum(count_prompt_in_ref(f, head_ref) for f in diff_files)
    delta_pct = (head_total - baseline_total) / max(baseline_total, 1) * 100
    return {"baseline": baseline_total, "head": head_total, "delta_pct": delta_pct}

CI:

- name: LLM token budget
  run: |
    python tools/llm_budget.py --baseline origin/main --max-delta-percent 10

5) RAG Architecture Karar

Vector DB Karar Tree

< 1M chunk?
  ├─ Yes → Postgres var mı?
  │      ├─ Yes → pgvector (HNSW index)
  │      └─ No  → Qdrant (lightweight)
  └─ No (1M-10M)?
       ├─ Hybrid search (BM25+vector)? → Qdrant veya Weaviate
       └─ Pure vector → Pinecone (managed) veya Qdrant
  > 10M chunk?
       └─ Milvus / managed Pinecone

Chunking

Token-bazlı 200-500; overlap %10-20.
Markdown header'a duyarlı (LangChain MarkdownHeaderTextSplitter).
Hierarchical: parent (full section) + child (chunk); retrieval child + context parent.

Embedding

Voyage (Anthropic preferred), OpenAI text-embedding-3-large, Cohere embed-v3.
Multilingual mı? → Cohere embed-v3 / Voyage multilingual.

Retrieval + Rerank

# 1. Vector retrieval top-50
results = vector_db.search(query_embedding, top_k=50, filter={...})

# 2. Rerank top-10
reranked = cohere.rerank(
    query=query, documents=[r.text for r in results],
    top_n=10, model="rerank-multilingual-v3.0",
).results

# 3. Hybrid: BM25 + vector reciprocal rank fusion (RRF)

Eval RAG (RAGAS)

from ragas import evaluate
from ragas.metrics import faithfulness, context_recall, context_precision, answer_relevancy

score = evaluate(
    dataset,
    metrics=[faithfulness, context_recall, context_precision, answer_relevancy],
    llm=judge_llm,
    embeddings=embedding_model,
)

6) Provider Abstraction

from typing import Protocol

class LLMProvider(Protocol):
    async def generate(self, messages: list[dict], **kwargs) -> str: ...
    async def stream(self, messages: list[dict], **kwargs): ...

class AnthropicProvider:
    def __init__(self, client): self.client = client
    async def generate(self, messages, model="claude-opus-4-7", max_tokens=2048):
        response = await self.client.messages.create(
            model=model, max_tokens=max_tokens, messages=messages,
        )
        return response.content[0].text

class FakeProvider:  # test
    async def generate(self, messages, **kwargs):
        return self._stub(messages)

App LLMProvider üzerinden çalışır; test'te FakeProvider.

7) Observability

# OTel span
import opentelemetry.trace as trace

tracer = trace.get_tracer(__name__)

async def generate_with_trace(messages, model):
    with tracer.start_as_current_span("llm.complete") as span:
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.messages.count", len(messages))
        start = time.time()
        response = await client.messages.create(model=model, messages=messages)
        elapsed_ms = (time.time() - start) * 1000
        span.set_attribute("llm.tokens.input", response.usage.input_tokens)
        span.set_attribute("llm.tokens.output", response.usage.output_tokens)
        span.set_attribute("llm.tokens.cache_read", getattr(response.usage, "cache_read_input_tokens", 0))
        span.set_attribute("llm.tokens.cache_write", getattr(response.usage, "cache_creation_input_tokens", 0))
        span.set_attribute("llm.latency.ms", elapsed_ms)
        span.set_attribute("llm.cost.usd", calculate_cost(response.usage, model))
        return response

8) Bulgu raporu

# LLM Ops Findings: <service>

## Critical
- [ ] Prompt cache markerleri yok (8 LLM call/request); cost projection
      $4,800/ay → $1,200 (cache hit %75 öngörüsü) — `src/llm/orchestrator.py:42`

## High
- [ ] Eval golden set yok; model bump regression catch yok — model artık
      Claude 4.6 → 4.7'ye geçecek (release plan)
- [ ] LLM-as-judge same-model (Claude 4.7 hem actor hem judge) — bias

## Medium
- [ ] Provider abstraction yok (Anthropic SDK doğrudan kod içinde 14 yerde)
- [ ] RAG retrieval eval yok (RAGAS metric ölçüm yok)

## Low
- [ ] Token sayımı `len(text) / 4` yaklaşımı (`src/llm/util.py:18`) — provider
      tokenizer kullan

Checklist

Antipattern

Prompt git history dışında değişiklik — diff görünmez.
Free-text parsing structured output yerine — fragile.
LLM-as-judge same-model — bias.
Eval yok — quality regression görünmez.
Cache devre dışı — cost %75'i ziyan.
Cache breakpoint dynamic data sonrası — cache miss.
Hardcoded provider class — vendor lock irretrievable.
len(text)/4 token tahmini — provider tokenizer kullan.
Test'te gerçek API key — CI cost burn.
PII prompt — privacy ihlali.
Adversarial test yok — prompt injection risk.
RAG vector DB önce, mimari sonra — yanlış sıra.
Embedding model rastgele — domain-uyumsuz.
Streaming timeout yok — UI hang.

Cross-Link

rules/llm-ops.md — discipline rule.
rules/security.md, rules/owasp-top10.md — prompt injection A03.
rules/privacy-engineering.md — PII prompt/response.
rules/performance-budget.md — latency p99 + cost budget.
rules/observability.md — OTel LLM span.
skills/postgres-performance/SKILL.md — pgvector HNSW.
skills/privacy-engineering/SKILL.md — PII retention.
skills/owasp-top10/SKILL.md — A03 prompt injection.
skills/performance-budget/SKILL.md — token budget CI gate.
agents/llm-engineer.md — sahiplik.
commands/llm-review.md — slash entrypoint.