From together-pack
Optimizes Together AI costs for inference, fine-tuning, and deployment using model selection, batching, caching, and Python code examples with OpenAI-compatible API.
How this skill is triggered — by the user, by Claude, or both
Slash command
/together-pack:together-cost-tuningThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Optimize Together AI costs with model selection, batching, and caching.
Optimize Together AI costs with model selection, batching, and caching.
| Model Category | Price (per 1M tokens) | Example Models |
|---|---|---|
| Small (< 10B) | $0.10-0.30 | Llama-3.2-3B, Qwen-2.5-7B |
| Medium (10-40B) | $0.60-1.20 | Mixtral-8x7B, Llama-3.3-70B-Turbo |
| Large (40B+) | $2.00-5.00 | Llama-3.1-405B, DeepSeek-V3 |
| Image gen | $0.003-0.05/image | FLUX.1-schnell, SDXL |
| Embeddings | $0.008/1M tokens | M2-BERT |
| Fine-tuning | ~$5-25/hour | Depends on model + GPU |
| Batch inference | 50% off | Same models, async |
# 1. Use Turbo variants (faster, cheaper, similar quality)
# meta-llama/Llama-3.3-70B-Instruct-Turbo vs Llama-3.1-70B-Instruct
# 2. Batch inference (50% cost reduction)
batch_response = client.batch.create(
input_file_id=file_id,
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
completion_window="24h",
)
# 3. Cache responses for identical prompts
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_completion(prompt: str, model: str) -> str:
response = client.chat.completions.create(
model=model, messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
# 4. Use smallest model that works
# Test with 3B first, upgrade to 70B only if quality insufficient
| Issue | Cause | Solution |
|---|---|---|
| High costs | Wrong model tier | Downsize model |
| Batch failures | Invalid input format | Validate JSONL |
| Fine-tuning expensive | Too many epochs | Start with 1-2 epochs |
For architecture patterns, see together-reference-architecture.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin together-packGuides performance tuning for Together AI inference, fine-tuning, and model deployment using OpenAI-compatible API. Covers errors, models, batch inference, and resources.
Cost estimation scripts and tools for calculating GPU hours, training costs, and inference pricing across Modal, Lambda Labs, and RunPod platforms. Use when estimating ML training costs, comparing platform pricing, calculating GPU hours, budgeting for ML projects, or when user mentions cost estimation, pricing comparison, GPU budgeting, training cost analysis, or inference cost optimization.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.