From together-pack
Optimizes Together AI costs for inference, fine-tuning, and deployment using model selection, batching, caching, and Python code examples with OpenAI-compatible API.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin together-packThis skill is limited to using the following tools:
Optimize Together AI costs with model selection, batching, and caching.
Guides performance tuning for Together AI inference, fine-tuning, and model deployment using OpenAI-compatible API. Covers errors, models, batch inference, and resources.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
Optimize Together AI costs with model selection, batching, and caching.
| Model Category | Price (per 1M tokens) | Example Models |
|---|---|---|
| Small (< 10B) | $0.10-0.30 | Llama-3.2-3B, Qwen-2.5-7B |
| Medium (10-40B) | $0.60-1.20 | Mixtral-8x7B, Llama-3.3-70B-Turbo |
| Large (40B+) | $2.00-5.00 | Llama-3.1-405B, DeepSeek-V3 |
| Image gen | $0.003-0.05/image | FLUX.1-schnell, SDXL |
| Embeddings | $0.008/1M tokens | M2-BERT |
| Fine-tuning | ~$5-25/hour | Depends on model + GPU |
| Batch inference | 50% off | Same models, async |
# 1. Use Turbo variants (faster, cheaper, similar quality)
# meta-llama/Llama-3.3-70B-Instruct-Turbo vs Llama-3.1-70B-Instruct
# 2. Batch inference (50% cost reduction)
batch_response = client.batch.create(
input_file_id=file_id,
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
completion_window="24h",
)
# 3. Cache responses for identical prompts
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_completion(prompt: str, model: str) -> str:
response = client.chat.completions.create(
model=model, messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
# 4. Use smallest model that works
# Test with 3B first, upgrade to 70B only if quality insufficient
| Issue | Cause | Solution |
|---|---|---|
| High costs | Wrong model tier | Downsize model |
| Batch failures | Invalid input format | Validate JSONL |
| Fine-tuning expensive | Too many epochs | Start with 1-2 epochs |
For architecture patterns, see together-reference-architecture.