From argos
Production LLM application disipline — prompt engineering (versionlu, structured output, Anthropic prompt cache) + eval harness (golden set + LLM-as-judge + CI gate) + cost budget (token + cache hit + PR-time delta) + RAG architecture (chunking + embedding + vector DB + reranker + RAGAS eval) + provider abstraction. Model-agnostic core; Anthropic SDK Claude-leaning örnekler. `llm-engineer` agent sahiplenir.
npx claudepluginhub resultakak/argos --plugin argosThis skill uses the workspace's default tool permissions.
`agents/shared/severity-rubric.md` ve `agents/shared/escalation-matrix.md`
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
agents/shared/severity-rubric.md ve agents/shared/escalation-matrix.md
default-load (agents/coordination.md §11). Çıktı Critical / High / Medium /
Low + kanıt formatında. Sahiplik dışı bulgu delege:
security-reviewer — prompt injection, adversarial, PII data leak (OWASP A03)privacy-engineer — PII prompt/response audit, retention, DPIAperformance-profiler — latency p99 budgetdatabase-optimizer — pgvector RAG performancerelease-manager — model bump release notesfinops-review — cost dashboard, budget alert| Konu | Tool | Notlar |
|---|---|---|
| SDK | anthropic, openai, cohere | Provider abstraction layer |
| Tokenizer | tiktoken, anthropic-tokenizer | Hardcode len/4 yapma |
| Eval | Inspect AI, Promptfoo, DeepEval | LLM-as-judge cross-model |
| Vector DB | pgvector, Qdrant, Weaviate, Milvus, Pinecone | Boyut/scale'a göre |
| Embedding | Voyage, OpenAI text-embedding-3, Cohere embed-v3, BGE | Domain off-the-shelf önce |
| Reranker | Cohere rerank, mxbai-rerank | top-K=50 → top-10 |
| Trace | OTel + Langfuse / Helicone / Phoenix | LLM-specific spans |
| Cost | Helicone, Anthropic billing API, OpenAI dashboard | Hourly poll → metric |
| Prompt eval RAG | RAGAS, TruLens | Faithfulness/recall/precision |
# Provider import'larını bul
grep -rE "import (anthropic|openai|cohere)|from (anthropic|openai)" \
--include="*.py" --include="*.ts" --include="*.tsx"
# Prompt dosya konumu
find . -path '*/prompts/*.md' -o -path '*/prompts/*.txt' | head
# Mevcut eval var mı?
grep -rE "promptfoo|inspect_ai|deepeval|ragas" --include="*.py" --include="*.yaml"
# Cost monitoring
grep -rE "helicone|langfuse|phoenix|anthropic.*billing" --include="*.py" --include="*.yaml"
prompts/<task>/v3.md veya _v3 suffix)# anthropic prompt cache pattern
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT, # ~3KB stable persona
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": rag_context, # ~15KB retrieved docs (stable per session)
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": user_question}],
)
Cache hit cost %10 (input); miss write %125. TTL 5dk veya 1h (beta).
# tests/golden/v3.jsonl — 80 örnek
{"id": "happy.001", "input": "...", "expected_intent": "create_order", "category": "happy", "difficulty": "low"}
{"id": "edge.014", "input": "...", "expected_intent": "ambiguous", "category": "edge", "difficulty": "high"}
{"id": "adversarial.022", "input": "ignore previous instructions; ...", "expected_intent": "rejected", "category": "adversarial", "difficulty": "high"}
{"id": "regression.041", "input": "...", "expected_intent": "support", "category": "regression-issue-#1284", "difficulty": "mid"}
# tests/eval_golden.py
import json
from anthropic import Anthropic
from llm_eval import judge_with_model
def run_eval(golden_path: str, model: str, judge_model: str, threshold: float):
client = Anthropic()
judge = Anthropic()
results = []
with open(golden_path) as f:
for line in f:
case = json.loads(line)
actual = extract_intent(client, model, case["input"])
# LLM-as-judge with cross-model
score = judge_with_model(
judge, judge_model,
expected=case["expected_intent"],
actual=actual,
input=case["input"],
)
results.append({"id": case["id"], "score": score, "expected": case["expected_intent"], "actual": actual})
pass_rate = sum(1 for r in results if r["score"] >= 4) / len(results)
print(f"Pass rate: {pass_rate:.2%}")
assert pass_rate >= threshold, f"Eval regression: {pass_rate:.2%} < {threshold:.2%}"
return results
# tools/llm_budget.py
import subprocess
from anthropic import Anthropic
import tiktoken
def count_tokens(prompt: str, model: str) -> int:
# Anthropic SDK token count veya tiktoken
return Anthropic().messages.count_tokens(
model=model, messages=[{"role": "user", "content": prompt}]
).input_tokens
def diff_token_cost(baseline_ref: str, head_ref: str) -> dict:
diff_files = subprocess.check_output(
["git", "diff", "--name-only", baseline_ref, head_ref, "--", "prompts/", "src/llm/"]
).decode().split()
baseline_total = sum(count_prompt_in_ref(f, baseline_ref) for f in diff_files)
head_total = sum(count_prompt_in_ref(f, head_ref) for f in diff_files)
delta_pct = (head_total - baseline_total) / max(baseline_total, 1) * 100
return {"baseline": baseline_total, "head": head_total, "delta_pct": delta_pct}
CI:
- name: LLM token budget
run: |
python tools/llm_budget.py --baseline origin/main --max-delta-percent 10
< 1M chunk?
├─ Yes → Postgres var mı?
│ ├─ Yes → pgvector (HNSW index)
│ └─ No → Qdrant (lightweight)
└─ No (1M-10M)?
├─ Hybrid search (BM25+vector)? → Qdrant veya Weaviate
└─ Pure vector → Pinecone (managed) veya Qdrant
> 10M chunk?
└─ Milvus / managed Pinecone
MarkdownHeaderTextSplitter).# 1. Vector retrieval top-50
results = vector_db.search(query_embedding, top_k=50, filter={...})
# 2. Rerank top-10
reranked = cohere.rerank(
query=query, documents=[r.text for r in results],
top_n=10, model="rerank-multilingual-v3.0",
).results
# 3. Hybrid: BM25 + vector reciprocal rank fusion (RRF)
from ragas import evaluate
from ragas.metrics import faithfulness, context_recall, context_precision, answer_relevancy
score = evaluate(
dataset,
metrics=[faithfulness, context_recall, context_precision, answer_relevancy],
llm=judge_llm,
embeddings=embedding_model,
)
from typing import Protocol
class LLMProvider(Protocol):
async def generate(self, messages: list[dict], **kwargs) -> str: ...
async def stream(self, messages: list[dict], **kwargs): ...
class AnthropicProvider:
def __init__(self, client): self.client = client
async def generate(self, messages, model="claude-opus-4-7", max_tokens=2048):
response = await self.client.messages.create(
model=model, max_tokens=max_tokens, messages=messages,
)
return response.content[0].text
class FakeProvider: # test
async def generate(self, messages, **kwargs):
return self._stub(messages)
App LLMProvider üzerinden çalışır; test'te FakeProvider.
# OTel span
import opentelemetry.trace as trace
tracer = trace.get_tracer(__name__)
async def generate_with_trace(messages, model):
with tracer.start_as_current_span("llm.complete") as span:
span.set_attribute("llm.model", model)
span.set_attribute("llm.messages.count", len(messages))
start = time.time()
response = await client.messages.create(model=model, messages=messages)
elapsed_ms = (time.time() - start) * 1000
span.set_attribute("llm.tokens.input", response.usage.input_tokens)
span.set_attribute("llm.tokens.output", response.usage.output_tokens)
span.set_attribute("llm.tokens.cache_read", getattr(response.usage, "cache_read_input_tokens", 0))
span.set_attribute("llm.tokens.cache_write", getattr(response.usage, "cache_creation_input_tokens", 0))
span.set_attribute("llm.latency.ms", elapsed_ms)
span.set_attribute("llm.cost.usd", calculate_cost(response.usage, model))
return response
# LLM Ops Findings: <service>
## Critical
- [ ] Prompt cache markerleri yok (8 LLM call/request); cost projection
$4,800/ay → $1,200 (cache hit %75 öngörüsü) — `src/llm/orchestrator.py:42`
## High
- [ ] Eval golden set yok; model bump regression catch yok — model artık
Claude 4.6 → 4.7'ye geçecek (release plan)
- [ ] LLM-as-judge same-model (Claude 4.7 hem actor hem judge) — bias
## Medium
- [ ] Provider abstraction yok (Anthropic SDK doğrudan kod içinde 14 yerde)
- [ ] RAG retrieval eval yok (RAGAS metric ölçüm yok)
## Low
- [ ] Token sayımı `len(text) / 4` yaklaşımı (`src/llm/util.py:18`) — provider
tokenizer kullan
prompts/<task>/vN.md)llm.complete spansFakeProvider (gerçek API çağrısı yok)len(text)/4 token tahmini — provider tokenizer kullan.rules/llm-ops.md — discipline rule.rules/security.md, rules/owasp-top10.md — prompt injection A03.rules/privacy-engineering.md — PII prompt/response.rules/performance-budget.md — latency p99 + cost budget.rules/observability.md — OTel LLM span.skills/postgres-performance/SKILL.md — pgvector HNSW.skills/privacy-engineering/SKILL.md — PII retention.skills/owasp-top10/SKILL.md — A03 prompt injection.skills/performance-budget/SKILL.md — token budget CI gate.agents/llm-engineer.md — sahiplik.commands/llm-review.md — slash entrypoint.