Skill

together-evaluations

Runs LLM-as-a-judge evaluations on Together AI to classify, score, and compare model outputs with external judges/targets. Polls results and downloads reports for benchmarks, grading, A/B variants.

Python

Typescript

Javascript

ai-ml

testing

npx claudepluginhub togethercomputer/skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Use Together AI evaluations when the user wants a managed LLM-as-a-judge workflow rather than an

Supporting Assets

agents/openai.yamlreferences/api-reference.mdscripts/run_evaluation.pyscripts/run_evaluation.ts

SKILL.md

Similar Skills

advanced-evaluation

Implements LLM-as-a-Judge techniques: direct scoring, pairwise comparison, rubric generation, bias mitigation. For building eval systems, comparing model outputs, setting AI quality standards.

6 files

shipshitdev-library

llm-evaluation

Evaluates LLM apps using automated metrics (BLEU, ROUGE, BERTScore, MRR), human feedback, and LLM-as-judge. For testing performance, benchmarking, and regressions.

llm-application-dev

advanced-evaluation

36.4k

Implements LLM-as-judge techniques including direct scoring, pairwise comparison, and bias mitigations for evaluating LLM outputs in production pipelines.

antigravity-awesome-skills

Stats

Stars22

Forks4

Last CommitMar 30, 2026

Used By2 plugins

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Together AI Evaluations

Overview

Use Together AI evaluations when the user wants a managed LLM-as-a-judge workflow rather than an ad hoc prompt loop.

Core evaluation types:

Classify: assign outputs to labels
Score: grade outputs on a numeric scale
Compare: compare two candidate outputs with bias controls

This skill also covers external providers used as judges or targets when the workflow still runs through Together AI's evaluation system.

When This Skill Wins

Benchmark prompt variants, models, or product responses
Grade quality, safety, policy compliance, or task success
Run A/B comparisons between model outputs
Build repeatable evaluation jobs with uploaded datasets
Pull results programmatically after asynchronous execution

Hand Off To Another Skill

Use together-chat-completions for one-off inference or manual judge prompts
Use together-batch-inference for bulk offline generation rather than evaluation
Use together-fine-tuning when the user wants to improve the model instead of just measure it
Use together-dedicated-endpoints only if the evaluation target itself is a dedicated endpoint

Quick Routing

Classify / Score / Compare job setup
- Start with scripts/run_evaluation.py or scripts/run_evaluation.ts
- Read references/api-reference.md for exact request shapes
Dataset formatting
- Read the dataset sections in references/api-reference.md
Dataset columns, Jinja2 templates, or pre-generated responses
- Read the dataset and template sections in references/api-reference.md
- Use --eval-column, --model-a-column, or --model-b-column in the scripts
External providers as judge or target
- Read the model-source and provider sections in references/api-reference.md
- Use the scripts with --judge-model-source external, --eval-model-source external, or compare-side source flags
Polling, listing, or downloading results
- Use the retrieval endpoints documented in references/api-reference.md
- Use --download-results in the scripts when you want the per-row JSONL locally

Workflow

Identify whether the user needs classify, score, or compare.
Define the dataset schema before writing code.
Upload the dataset as an eval file and keep the returned file ID.
Configure judge and target models explicitly, especially when mixing providers.
Poll status until completion, then download the result file for analysis.

High-Signal Rules

Python scripts require the Together v2 SDK (together>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".
The current SDK examples in this repo use check=False for eval uploads because local file validation can misclassify eval datasets.
Treat dataset schema as part of the product contract; inconsistent fields cause downstream confusion.
Compare evaluations are best when both candidate responses are already present in the dataset.
Keep judge configuration explicit. Hidden defaults make benchmark interpretation harder.
Use Together AI's managed evaluation job instead of rebuilding a manual judge loop when repeatability matters.

Resource Map

Full API reference: references/api-reference.md
Dataset formats, Jinja2 templates, and provider shortcuts: references/api-reference.md
Python workflow: scripts/run_evaluation.py
TypeScript workflow: scripts/run_evaluation.ts

together-evaluations

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

together-evaluations

Tool Access

Preview

Supporting Assets

SKILL.md

Together AI Evaluations

Overview

When This Skill Wins

Hand Off To Another Skill

Quick Routing

Workflow

High-Signal Rules

Resource Map

Official Docs

Similar Skills

Help us improve

Together AI Evaluations

Overview

When This Skill Wins

Hand Off To Another Skill

Quick Routing

Workflow

High-Signal Rules

Resource Map

Official Docs