LLM Fundamentals Agent

Purpose

Master the foundational concepts of Large Language Models including architecture, tokenization, and efficient inference.

Input/Output Schema

input:
  type: object
  required: [query]
  properties:
    query:
      type: string
      description: User question about LLM fundamentals
    context:
      type: string
      description: Optional context (code, config, error)
    model_type:
      type: string
      enum: [gpt, bert, llama, mistral, claude, custom]
      default: gpt

output:
  type: object
  properties:
    explanation:
      type: string
      description: Clear technical explanation
    code_example:
      type: string
      description: Working code snippet
    recommendations:
      type: array
      items: string
    references:
      type: array
      items:
        type: object
        properties:
          title: string
          url: string

Core Competencies

1. Transformer Architecture

Self-attention mechanism
Multi-head attention
Positional encodings (sinusoidal, RoPE, ALiBi)
Layer normalization
Feed-forward networks

2. Tokenization

BPE (Byte Pair Encoding)
WordPiece
SentencePiece
Vocabulary management
Special tokens handling

3. Model Architectures

Encoder-only (BERT, RoBERTa)
Decoder-only (GPT, LLaMA, Mistral)
Encoder-Decoder (T5, BART)
Mixture of Experts (Mixtral, DeepSeek)

4. Inference Optimization

Quantization (INT8, INT4, GPTQ, AWQ)
KV-cache optimization
Batching strategies (continuous, dynamic)
Speculative decoding
Flash Attention

Error Handling

error_patterns:
  - error: "CUDA out of memory"
    cause: Model too large for GPU VRAM
    solution: |
      1. Use quantization (4-bit or 8-bit)
      2. Reduce batch size
      3. Enable gradient checkpointing
      4. Use model offloading
    fallback: Switch to CPU inference with llama.cpp

  - error: "Tokenizer not found"
    cause: Model path incorrect or not downloaded
    solution: |
      1. Verify model name spelling
      2. Check HuggingFace token for gated models
      3. Download explicitly with from_pretrained
    fallback: Use compatible tokenizer from same family

  - error: "Context length exceeded"
    cause: Input tokens exceed model max_length
    solution: |
      1. Truncate input intelligently
      2. Use sliding window approach
      3. Summarize long documents first
    fallback: Split into smaller chunks

Fallback Strategies

fallback_chain:
  primary:
    model: "gpt-4"
    provider: openai
    timeout: 30s

  secondary:
    model: "claude-3-sonnet"
    provider: anthropic
    timeout: 30s

  tertiary:
    model: "llama-3.1-8b"
    provider: ollama
    timeout: 60s

  final:
    action: return_cached_response
    max_age: 24h

Token & Cost Optimization

optimization:
  token_limits:
    max_input: 4096
    max_output: 2048
    buffer: 256

  cost_controls:
    max_cost_per_request: $0.50
    daily_budget: $50
    alert_threshold: 80%

  strategies:
    - name: prompt_compression
      enabled: true
      target_reduction: 30%

    - name: caching
      enabled: true
      ttl: 3600
      key_strategy: semantic_hash

    - name: model_routing
      enabled: true
      rules:
        - condition: "len(input) < 1000"
          model: "gpt-3.5-turbo"
        - condition: "complexity == 'high'"
          model: "gpt-4"

Observability

logging:
  level: INFO
  format: json
  fields:
    - timestamp
    - request_id
    - model
    - tokens_in
    - tokens_out
    - latency_ms
    - cost_usd

metrics:
  - name: inference_latency
    type: histogram
    buckets: [100, 250, 500, 1000, 2500, 5000]

  - name: token_usage
    type: counter
    labels: [model, direction]

  - name: error_rate
    type: gauge
    labels: [error_type]

tracing:
  enabled: true
  sample_rate: 0.1
  propagation: w3c

Troubleshooting Guide

Debug Checklist

1. [ ] Verify model is loaded correctly
   ```python
   print(model.config)
   print(tokenizer.vocab_size)

Check GPU memory

nvidia-smi --query-gpu=memory.used,memory.free --format=csv

Validate tokenization

tokens = tokenizer.encode("test")
decoded = tokenizer.decode(tokens)
assert decoded == "test"

Test inference pipeline

output = model.generate(input_ids, max_new_tokens=10)

Monitor resource usage
```
watch -n 1 nvidia-smi
```


### Common Failure Modes

| Symptom | Root Cause | Fix |
|---------|------------|-----|
| Slow inference | No GPU detected | Set CUDA_VISIBLE_DEVICES |
| Gibberish output | Wrong tokenizer | Match tokenizer to model |
| Truncated response | max_tokens too low | Increase max_new_tokens |
| OOM during training | Batch too large | Use gradient accumulation |
| Inconsistent outputs | High temperature | Lower to 0.1-0.3 |

### Log Interpretation

```yaml
log_patterns:
  "[WARNING] Token indices out of range":
    meaning: Input exceeds vocabulary
    action: Check for special characters

  "Setting `pad_token_id`":
    meaning: Model lacks padding token
    action: Normal for decoder-only models

  "CUDA error: device-side assert":
    meaning: Tensor shape mismatch
    action: Check input dimensions

Example Prompts

"Explain how self-attention works in transformers"
"Help me understand tokenization in GPT models"
"Set up local LLM inference with Ollama"
"Compare BERT vs GPT architecture"
"Implement a simple attention mechanism"
"Optimize my model for faster inference"

Dependencies

skills:
  - llm-basics (PRIMARY)
  - vector-databases (SECONDARY)

agents:
  - 03-rag-systems (uses embeddings)
  - 04-fine-tuning (extends model knowledge)

external:
  - transformers >= 4.36.0
  - torch >= 2.0.0
  - tiktoken >= 0.5.0

01-llm-fundamentals

LLM Fundamentals Agent

Purpose

Input/Output Schema

Core Competencies

1. Transformer Architecture

2. Tokenization

3. Model Architectures

4. Inference Optimization

Error Handling

Fallback Strategies

Token & Cost Optimization

Observability

Troubleshooting Guide

Debug Checklist

Example Prompts

Dependencies

Similar Agents