From flagos-skills
Runs accuracy (FlagEval) and performance benchmarks (vllm bench serve) on served vLLM models across 5 profiles: short/long prefill/decode + concurrency. Collects throughput, latency, TTFT, TPOT metrics.
npx claudepluginhub flagos-ai/skills --plugin flagos-skillsThis skill is limited to using the following tools:
Start vLLM serve with the target model, run accuracy benchmarks (when FlagEval is
Runs vLLM serving benchmarks with synthetic random data to measure throughput, TTFT, TPOT, inter-token latency. Quick tests without external datasets.
Verifies AI model serving stack in Docker containers: runs inference/serve tests twice (FlagGems/FlagCX disabled vs full multi-GPU stack) and diffs outputs to isolate failures.
Evaluates LLMs on 60+ benchmarks like MMLU, HumanEval, GSM8K, TruthfulQA using lm-eval. Supports HuggingFace, vLLM, APIs for benchmarking model quality and comparisons.
Share bugs, ideas, or general feedback.
Start vLLM serve with the target model, run accuracy benchmarks (when FlagEval is available) and performance benchmarks (vllm bench serve) across multiple profiles.
perf-test/
├── SKILL.md # This file — execution flow
├── scripts/
│ ├── run_benchmark.py # Run single benchmark profile (JSON output)
│ └── run_all_benchmarks.py # Run all 5 profiles, collect + summarize (JSON)
└── references/
└── benchmark-profiles.md # Profile definitions, metrics, vllm bench usage
Reused from env-verify:
env-verify/scripts/test_serve_mode.py — can be used to verify server is healthy
before benchmarking (optional pre-check)full vs base)If invoked standalone, ask for container name, model path, TP size, and stack config.
If invoked from /flagrelease, these are passed as context.
Use the stack recommended by model-verify. Read references/benchmark-profiles.md
for the vllm serve command pattern.
docker exec -d <CONTAINER> bash -c '
export USE_FLAGGEMS=<0|1>
export FLAGCX_PATH=<path_or_unset>
export VLLM_PLUGINS=<fl_or_unset>
vllm serve <MODEL_PATH> \
--tensor-parallel-size <TP_SIZE> \
--max-num-batched-tokens 4096 \
--max-num-seqs 256 \
--trust-remote-code \
--port 8000 \
<EXTRA_ARGS>
'
Wait for server ready (poll /health, timeout 300s):
docker exec <CONTAINER> bash -c '
for i in $(seq 1 150); do
if curl -s http://localhost:8000/health 2>/dev/null | grep -qE "ok|200|\{\}"; then
echo "SERVER_READY"; break
fi
sleep 2
done
'
If server doesn't start, report error and exit.
docker exec <CONTAINER> bash -c '
curl -s http://localhost:8000/v1/models | python3 -c "
import json, sys; print(json.load(sys.stdin)[\"data\"][0][\"id\"])"
'
STATUS: FlagEval test client not yet available.
When FlagEval becomes available, update this section with:
Current behavior: Report accuracy test as SKIPPED.
Copy scripts into the container and run:
docker cp <SKILL_DIR>/scripts/run_benchmark.py <CONTAINER>:/tmp/
docker cp <SKILL_DIR>/scripts/run_all_benchmarks.py <CONTAINER>:/tmp/
docker exec <CONTAINER> python3 /tmp/run_all_benchmarks.py \
--model <MODEL_NAME> \
--tokenizer <MODEL_PATH> \
--port 8000 \
--output-dir /data/results/perf
The script runs all 5 default profiles (see references/benchmark-profiles.md),
saves per-profile JSON to /data/results/perf/, and outputs a combined JSON report
with a summary table.
Important: One profile failure does NOT skip remaining profiles.
docker exec <CONTAINER> bash -c 'pkill -f "vllm serve" || true'
{
"status": "PASS | PARTIAL | FAIL",
"stage": "perf-test",
"model": "<MODEL_PATH>",
"tensor_parallel_size": 8,
"flags": {"USE_FLAGGEMS": "1|0", "FLAGCX_PATH": "..."},
"accuracy": {
"status": "SKIPPED",
"reason": "FlagEval test client not yet available"
},
"performance": {
"status": "PASS | PARTIAL | FAIL",
"profiles_passed": "5/5",
"profiles": [ "...per-profile results..." ],
"summary_table": "...markdown table..."
}
}
Present the summary table to the user:
| Profile | Input | Output | Prompts | Req/s | Tok/s | TTFT(ms) | TPOT(ms) | P99(ms) | Status |
|---------|-------|--------|---------|-------|-------|----------|----------|---------|--------|
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Status logic:
PASS — all profiles completedPARTIAL — some passed, some failedFAIL — server didn't start or all profiles failed| Failure | Behavior |
|---|---|
| Server fails to start | Report error; exit |
vllm bench serve not found | Report vllm version issue |
| Single profile fails | Report error, continue remaining profiles |
| Single profile times out | Kill after 600s, report partial, continue |
| Server crashes mid-benchmark | Capture logs, report which profile caused crash |
| OOM during high concurrency | Report, suggest reducing num_prompts |
| Operation | Timeout |
|---|---|
| Server startup | 300s |
| Per profile benchmark | 600s |