Skill

cost-aware-llm-pipeline

From ecc

LLM API 사용 비용 최적화 패턴입니다. 작업 복잡도 기반 모델 라우팅, 예산 추적, 재시도 로직, 프롬프트 캐싱을 다룹니다.

npx claudepluginhub sam42-lab/everything-claude-code-kr

Tool Access

This skill uses the workspace's default tool permissions.

Preview

품질을 유지하면서 LLM API 비용을 통제하기 위한 패턴입니다. 모델 라우팅, 예산 추적, 재시도 로직, 프롬프트 캐싱을 조합 가능한 파이프라인으로 묶습니다.

SKILL.md

Similar Skills

using-superpowers

178.4k

Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.

3 files

superpowers

Stats

Stars0

Forks0

Last CommitApr 12, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

비용 인지형 LLM 파이프라인

사용 시점

LLM API(Claude, GPT 등)를 호출하는 애플리케이션을 만들 때
복잡도가 다른 항목을 배치 처리할 때
API 비용을 특정 예산 안에 유지해야 할 때
복잡한 작업의 품질을 해치지 않으면서 비용을 최적화할 때

핵심 개념

1. 작업 복잡도 기반 모델 라우팅

간단한 작업에는 저렴한 모델을 자동 선택하고, 복잡한 작업에만 비싼 모델을 사용합니다.

MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(
    text_length: int,
    item_count: int,
    force_model: str | None = None,
) -> str:
    """Select model based on task complexity."""
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU  # Simple task (3-4x cheaper)

2. 불변 비용 추적

동결된 dataclass로 누적 비용을 추적합니다. 각 API 호출은 새 추적 객체를 반환하며, 기존 상태를 변경하지 않습니다.

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

3. 좁은 범위의 재시도 로직

일시적 오류에만 재시도합니다. 인증 오류나 잘못된 요청은 즉시 실패 처리합니다.

from anthropic import (
    APIConnectionError,
    InternalServerError,
    RateLimitError,
)

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3

def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
    """Retry only on transient errors, fail fast on others."""
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError etc. → raise immediately

4. 프롬프트 캐싱

긴 시스템 프롬프트를 캐시해 매 요청마다 다시 보내지 않도록 합니다.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache this
            },
            {
                "type": "text",
                "text": user_input,  # Variable part
            },
        ],
    }
]

조합 방식

네 가지 기법을 하나의 파이프라인 함수로 조합합니다.

def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
    # 1. Route model
    model = select_model(len(text), estimated_items, config.force_model)

    # 2. Check budget
    if tracker.over_budget:
        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)

    # 3. Call with retry + caching
    response = call_with_retry(lambda: client.messages.create(
        model=model,
        messages=build_cached_messages(system_prompt, text),
    ))

    # 4. Track cost (immutable)
    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)
    tracker = tracker.add(record)

    return parse_result(response), tracker

가격 참고(2025-2026)

Model	Input ($/1M tokens)	Output ($/1M tokens)	Relative Cost
Haiku 4.5	$0.80	$4.00	1x
Sonnet 4.6	$3.00	$15.00	~4x
Opus 4.5	$15.00	$75.00	~19x

모범 사례

가장 저렴한 모델부터 시작하고, 복잡도 임계값을 넘는 경우에만 비싼 모델로 보냅니다.
명시적 예산 한도 설정 후 배치를 처리합니다. 과소비보다 조기 실패가 낫습니다.
모델 선택 결정 기록을 남겨 실제 데이터 기반으로 임계값을 튜닝합니다.
1024토큰이 넘는 시스템 프롬프트는 캐싱합니다. 비용과 지연 시간을 모두 줄입니다.
인증 또는 검증 오류는 재시도하지 않음. 네트워크, 레이트 리밋, 서버 오류 같은 일시적 실패만 재시도합니다.

피해야 할 안티패턴

복잡도와 무관하게 모든 요청에 가장 비싼 모델을 사용하는 것
모든 오류에 재시도하는 것(영구 실패에 예산을 낭비함)
비용 추적 상태를 변경 가능한 형태로 두는 것(디버깅과 감사가 어려워짐)
코드베이스 전역에 모델명을 하드코딩하는 것(상수나 설정 사용)
반복되는 시스템 프롬프트에 캐싱을 적용하지 않는 것

사용 대상

Claude, OpenAI 등 유사한 LLM API를 호출하는 모든 애플리케이션
비용이 빠르게 누적되는 배치 처리 파이프라인
지능형 라우팅이 필요한 멀티모델 아키텍처
예산 가드레일이 필요한 운영 시스템