Master LLM fine-tuning techniques including LoRA, QLoRA, and instruction tuning
Implement LoRA, QLoRA, and instruction tuning for efficient LLM adaptation. Prepare datasets, configure training, and evaluate fine-tuned models on specific tasks.
/plugin marketplace add pluginagentmarketplace/custom-plugin-ai-engineer/plugin install pluginagentmarketplace-ai-engineer-plugin@pluginagentmarketplace/custom-plugin-ai-engineersonnetMaster the techniques for adapting LLMs to specific tasks through efficient fine-tuning.
input:
type: object
required: [base_model, task_type, dataset]
properties:
base_model:
type: string
description: HuggingFace model ID
examples: ["meta-llama/Llama-3.1-8B", "mistralai/Mistral-7B-v0.1"]
task_type:
type: string
enum: [instruction, chat, classification, generation]
dataset:
type: object
properties:
path: string
format: string # alpaca, sharegpt, custom
train_split: number
eval_split: number
training_config:
type: object
properties:
method: string # lora, qlora, full
epochs: integer
batch_size: integer
learning_rate: number
hardware:
type: object
properties:
gpus: integer
vram_per_gpu: string
output:
type: object
properties:
training_script:
type: string
description: Complete training code
config_files:
type: object
description: YAML/JSON configs
evaluation_plan:
type: object
description: How to evaluate the model
estimated_resources:
type: object
properties:
training_time: string
gpu_memory: string
storage: string
error_patterns:
- error: "CUDA out of memory"
cause: Model/batch too large for VRAM
solution: |
1. Reduce batch_size
2. Enable gradient_checkpointing
3. Use QLoRA instead of LoRA
4. Reduce max_seq_length
fallback: Use CPU offloading with DeepSpeed
- error: "Loss is NaN"
cause: Learning rate too high or data issues
solution: |
1. Lower learning rate (try 1e-5)
2. Add gradient clipping (max_norm=1.0)
3. Check for NaN/Inf in dataset
4. Use bf16 instead of fp16
fallback: Start from checkpoint before NaN
- error: "Model not improving"
cause: Learning rate too low or data quality
solution: |
1. Increase learning rate
2. Check dataset quality
3. Verify data format matches model
4. Increase LoRA rank
fallback: Try different base model
- error: "Catastrophic forgetting"
cause: Over-training on narrow dataset
solution: |
1. Reduce epochs
2. Lower learning rate
3. Add regularization
4. Mix in general data
fallback: Use smaller LoRA rank
training_fallback:
memory_pressure:
- action: reduce_batch_size
factor: 0.5
min: 1
- action: enable_gradient_checkpointing
saves: ~40% VRAM
- action: switch_to_qlora
saves: ~50% VRAM
- action: use_deepspeed_offload
saves: ~60% VRAM
training_instability:
- action: reduce_learning_rate
factor: 0.1
- action: add_warmup_steps
ratio: 0.03
- action: enable_gradient_clipping
max_norm: 1.0
- action: switch_to_bf16
reason: More stable than fp16
optimization:
hardware_costs:
a100_80gb:
hourly: $4.00
typical_training: 2-8 hours
a10g_24gb:
hourly: $1.50
typical_training: 4-16 hours
rtx_4090:
hourly: $0.50 (owned)
qlora_capable: true
training_efficiency:
qlora_vs_lora:
memory_savings: 50%
speed_impact: -10%
quality_impact: -2-5%
unsloth_optimization:
memory_savings: 40%
speed_improvement: 2x
compatible_models: [llama, mistral, gemma]
dataset_optimization:
deduplication: true
quality_filter: true
optimal_size: 10K-100K examples
diminishing_returns_after: 500K examples
training_metrics:
loss:
- train_loss
- eval_loss
- gradient_norm
learning:
- learning_rate
- epoch
- step
resources:
- gpu_memory_allocated
- gpu_utilization
- throughput_samples_per_second
quality:
- eval_accuracy
- eval_perplexity
- benchmark_scores
logging:
wandb:
enabled: true
project: fine-tuning
log_model: checkpoint
tensorboard:
enabled: true
log_dir: ./runs
callbacks:
- early_stopping
- model_checkpoint
- lr_scheduler
1. [ ] Verify dataset format
```python
from datasets import load_dataset
ds = load_dataset("json", data_files="train.json")
print(ds["train"][0]) # Check structure
Validate tokenization
sample = tokenizer(ds["train"][0]["text"])
print(f"Tokens: {len(sample['input_ids'])}")
decoded = tokenizer.decode(sample['input_ids'])
Check model loading
print(model)
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
Monitor first batch
for batch in dataloader:
outputs = model(**batch)
print(f"Loss: {outputs.loss.item()}")
break
Verify checkpointing
ls -la output_dir/
# Should see checkpoint-* directories
### Common Failure Modes
| Symptom | Root Cause | Fix |
|---------|------------|-----|
| OOM at start | Model too large | Use QLoRA + 4-bit |
| OOM mid-training | Gradient accumulation | Clear cache, reduce batch |
| Loss spikes | Bad data sample | Filter outliers |
| No improvement | LR too low | Increase 10x |
| Overfitting | Too many epochs | Early stopping |
### LoRA Hyperparameter Guide
```yaml
rank (r):
4-8: Simple tasks, single domain
16-32: General instruction tuning
64-128: Complex adaptation, multiple tasks
recommendation: Start with 16, adjust based on results
alpha:
formula: alpha = 2 * rank
effect: Higher = more LoRA influence
typical: 32 for r=16
target_modules:
minimal: [q_proj, v_proj]
recommended: [q_proj, k_proj, v_proj, o_proj]
aggressive: + [gate_proj, up_proj, down_proj]
dropout:
none: 0.0 (most common)
light: 0.05
aggressive: 0.1 (prevents overfitting)
from peft import LoraConfig
config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Alpha scaling
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
skills:
- fine-tuning (PRIMARY)
agents:
- 01-llm-fundamentals (base model knowledge)
- 05-evaluation-monitoring (model evaluation)
external:
- transformers >= 4.36.0
- peft >= 0.7.0
- trl >= 0.7.0
- bitsandbytes >= 0.41.0
- datasets >= 2.14.0
You are an elite AI agent architect specializing in crafting high-performance agent configurations. Your expertise lies in translating user requirements into precisely-tuned agent specifications that maximize effectiveness and reliability.