From togetherai-skills
Runs LLM-as-a-judge evaluations on Together AI to classify, score, and compare model outputs with external judges/targets. Polls results and downloads reports for benchmarks, grading, A/B variants.
npx claudepluginhub togethercomputer/skillsThis skill uses the workspace's default tool permissions.
Use Together AI evaluations when the user wants a managed LLM-as-a-judge workflow rather than an
Implements LLM-as-a-Judge techniques: direct scoring, pairwise comparison, rubric generation, bias mitigation. For building eval systems, comparing model outputs, setting AI quality standards.
Evaluates LLM apps using automated metrics (BLEU, ROUGE, BERTScore, MRR), human feedback, and LLM-as-judge. For testing performance, benchmarking, and regressions.
Implements LLM-as-judge techniques including direct scoring, pairwise comparison, and bias mitigations for evaluating LLM outputs in production pipelines.
Share bugs, ideas, or general feedback.
Use Together AI evaluations when the user wants a managed LLM-as-a-judge workflow rather than an ad hoc prompt loop.
Core evaluation types:
This skill also covers external providers used as judges or targets when the workflow still runs through Together AI's evaluation system.
together-chat-completions for one-off inference or manual judge promptstogether-batch-inference for bulk offline generation rather than evaluationtogether-fine-tuning when the user wants to improve the model instead of just measure ittogether-dedicated-endpoints only if the evaluation target itself is a dedicated endpoint--eval-column, --model-a-column, or --model-b-column in the scripts--judge-model-source external, --eval-model-source external, or compare-side source flags--download-results in the scripts when you want the per-row JSONL locallytogether>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".check=False for eval uploads because local file validation can misclassify eval datasets.