LLM Call Tracing

Instrument LLM API calls to track latency, tokens, costs, and errors.

Core Principle

Every LLM call should capture:

What model was called
How long it took
How many tokens were used
Did it succeed or fail
Why it failed (if applicable)

Essential Span Attributes

# Required (P0)
span.set_attribute("llm.model", "claude-3-opus-20240229")
span.set_attribute("llm.provider", "anthropic")
span.set_attribute("llm.latency_ms", 2340)
span.set_attribute("llm.success", True)

# Token tracking (P1)
span.set_attribute("llm.tokens.input", 1500)
span.set_attribute("llm.tokens.output", 350)
span.set_attribute("llm.tokens.total", 1850)

# Cost (P1)
span.set_attribute("llm.cost_usd", 0.025)

# Configuration (P2)
span.set_attribute("llm.temperature", 0.7)
span.set_attribute("llm.max_tokens", 4096)
span.set_attribute("llm.stop_reason", "end_turn")

# Error context (when applicable)
span.set_attribute("llm.error.type", "rate_limit")
span.set_attribute("llm.error.message", "Rate limit exceeded")
span.set_attribute("llm.retry_count", 2)

What NOT to Log

Never log full prompts/responses:

# BAD - PII risk, storage explosion
span.set_attribute("llm.prompt", messages)
span.set_attribute("llm.response", completion.content)

# GOOD - Safe metadata
span.set_attribute("llm.prompt.message_count", len(messages))
span.set_attribute("llm.prompt.system_length", len(system_prompt))
span.set_attribute("llm.response.length", len(completion.content))

Streaming Considerations

For streaming responses:

Start span when stream begins
Update token counts as chunks arrive
End span when stream completes
Track time-to-first-token (TTFT)

span.set_attribute("llm.streaming", True)
span.set_attribute("llm.ttft_ms", 145)  # Time to first token
span.set_attribute("llm.chunks", 47)     # Number of chunks

Cost Calculation

Calculate cost from tokens and model pricing:

PRICING = {
    "claude-3-opus": {"input": 15.00, "output": 75.00},  # per 1M tokens
    "claude-3-sonnet": {"input": 3.00, "output": 15.00},
    "claude-3-haiku": {"input": 0.25, "output": 1.25},
    "gpt-4-turbo": {"input": 10.00, "output": 30.00},
    "gpt-4o": {"input": 5.00, "output": 15.00},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = PRICING.get(model, {"input": 0, "output": 0})
    input_cost = (input_tokens / 1_000_000) * pricing["input"]
    output_cost = (output_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)

Framework-Specific Patterns

LangChain

from langfuse.callback import CallbackHandler

handler = CallbackHandler()
chain.invoke(input, config={"callbacks": [handler]})

Direct Anthropic SDK

from langfuse.decorators import observe

@observe(as_type="generation")
def call_claude(messages):
    response = client.messages.create(...)
    return response

OpenAI SDK

from langfuse.openai import openai

# Automatic instrumentation
client = openai.OpenAI()

Error Handling

Capture errors with context:

try:
    response = client.messages.create(...)
except RateLimitError as e:
    span.set_attribute("llm.error.type", "rate_limit")
    span.set_attribute("llm.error.retry_after", e.retry_after)
    raise
except APIError as e:
    span.set_attribute("llm.error.type", "api_error")
    span.set_attribute("llm.error.status", e.status_code)
    raise

Anti-Patterns

See references/anti-patterns/llm-tracing.md:

Logging full prompts (PII, storage)
Blocking on telemetry (latency)
Missing token counts (cost blindness)
No retry tracking (hidden failures)

Related Skills

token-cost-tracking - Detailed cost attribution
error-retry-tracking - Error handling patterns

llm-call-tracing