Master LLM architecture, tokenization, transformer models, and inference optimization
Explains transformer architecture, tokenization, and inference optimization for LLMs.
/plugin marketplace add pluginagentmarketplace/custom-plugin-ai-engineer/plugin install pluginagentmarketplace-ai-engineer-plugin@pluginagentmarketplace/custom-plugin-ai-engineersonnetMaster the foundational concepts of Large Language Models including architecture, tokenization, and efficient inference.
input:
type: object
required: [query]
properties:
query:
type: string
description: User question about LLM fundamentals
context:
type: string
description: Optional context (code, config, error)
model_type:
type: string
enum: [gpt, bert, llama, mistral, claude, custom]
default: gpt
output:
type: object
properties:
explanation:
type: string
description: Clear technical explanation
code_example:
type: string
description: Working code snippet
recommendations:
type: array
items: string
references:
type: array
items:
type: object
properties:
title: string
url: string
error_patterns:
- error: "CUDA out of memory"
cause: Model too large for GPU VRAM
solution: |
1. Use quantization (4-bit or 8-bit)
2. Reduce batch size
3. Enable gradient checkpointing
4. Use model offloading
fallback: Switch to CPU inference with llama.cpp
- error: "Tokenizer not found"
cause: Model path incorrect or not downloaded
solution: |
1. Verify model name spelling
2. Check HuggingFace token for gated models
3. Download explicitly with from_pretrained
fallback: Use compatible tokenizer from same family
- error: "Context length exceeded"
cause: Input tokens exceed model max_length
solution: |
1. Truncate input intelligently
2. Use sliding window approach
3. Summarize long documents first
fallback: Split into smaller chunks
fallback_chain:
primary:
model: "gpt-4"
provider: openai
timeout: 30s
secondary:
model: "claude-3-sonnet"
provider: anthropic
timeout: 30s
tertiary:
model: "llama-3.1-8b"
provider: ollama
timeout: 60s
final:
action: return_cached_response
max_age: 24h
optimization:
token_limits:
max_input: 4096
max_output: 2048
buffer: 256
cost_controls:
max_cost_per_request: $0.50
daily_budget: $50
alert_threshold: 80%
strategies:
- name: prompt_compression
enabled: true
target_reduction: 30%
- name: caching
enabled: true
ttl: 3600
key_strategy: semantic_hash
- name: model_routing
enabled: true
rules:
- condition: "len(input) < 1000"
model: "gpt-3.5-turbo"
- condition: "complexity == 'high'"
model: "gpt-4"
logging:
level: INFO
format: json
fields:
- timestamp
- request_id
- model
- tokens_in
- tokens_out
- latency_ms
- cost_usd
metrics:
- name: inference_latency
type: histogram
buckets: [100, 250, 500, 1000, 2500, 5000]
- name: token_usage
type: counter
labels: [model, direction]
- name: error_rate
type: gauge
labels: [error_type]
tracing:
enabled: true
sample_rate: 0.1
propagation: w3c
1. [ ] Verify model is loaded correctly
```python
print(model.config)
print(tokenizer.vocab_size)
Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.free --format=csv
Validate tokenization
tokens = tokenizer.encode("test")
decoded = tokenizer.decode(tokens)
assert decoded == "test"
Test inference pipeline
output = model.generate(input_ids, max_new_tokens=10)
Monitor resource usage
watch -n 1 nvidia-smi
### Common Failure Modes
| Symptom | Root Cause | Fix |
|---------|------------|-----|
| Slow inference | No GPU detected | Set CUDA_VISIBLE_DEVICES |
| Gibberish output | Wrong tokenizer | Match tokenizer to model |
| Truncated response | max_tokens too low | Increase max_new_tokens |
| OOM during training | Batch too large | Use gradient accumulation |
| Inconsistent outputs | High temperature | Lower to 0.1-0.3 |
### Log Interpretation
```yaml
log_patterns:
"[WARNING] Token indices out of range":
meaning: Input exceeds vocabulary
action: Check for special characters
"Setting `pad_token_id`":
meaning: Model lacks padding token
action: Normal for decoder-only models
"CUDA error: device-side assert":
meaning: Tensor shape mismatch
action: Check input dimensions
skills:
- llm-basics (PRIMARY)
- vector-databases (SECONDARY)
agents:
- 03-rag-systems (uses embeddings)
- 04-fine-tuning (extends model knowledge)
external:
- transformers >= 4.36.0
- torch >= 2.0.0
- tiktoken >= 0.5.0
Designs feature architectures by analyzing existing codebase patterns and conventions, then providing comprehensive implementation blueprints with specific files to create/modify, component designs, data flows, and build sequences