Skill

r-vitals

Use when code loads or uses vitals (library(vitals), vitals::), evaluating LLM output quality, scoring AI responses, testing RAG retrieval accuracy, or benchmarking prompt changes in R

npx claudepluginhub arthurgailes/r-package-skills --plugin r-package-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

**vitals tests LLM output quality.** Create test datasets, define solvers (LLM pipelines), score outputs. Benchmark RAG systems, prompt changes, model performance.

Supporting Assets

references/API.mdreferences/package-docs.md

SKILL.md

Similar Skills

llm-evaluation

Evaluates LLM apps using automated metrics (BLEU, ROUGE, BERTScore, MRR), human feedback, and LLM-as-judge. For testing performance, benchmarking, and regressions.

llm-application-dev

r-ai

Use when code loads ellmer, btw, mcptools, ragnar, or vitals, building LLM-powered R applications, implementing RAG workflows, or choosing between R AI packages (meta-skill for ellmer/btw/mcptools/ragnar/vitals)

1 file

r-package-skills

llm-evaluation

682

Implements LLM evaluation strategies using automated metrics (BLEU, ROUGE, BERTScore, perplexity), human feedback, LLM-as-judge, and benchmarking. For testing AI app performance and quality.

rmyndharis-antigravity-skills

Stats

Stars10

Forks1

Last CommitApr 24, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

vitals: LLM Evaluation and Testing

Overview

vitals tests LLM output quality. Create test datasets, define solvers (LLM pipelines), score outputs. Benchmark RAG systems, prompt changes, model performance.

Install: install.packages("vitals")

References

Read references/API.md before writing code.

references/API.md - Complete function reference
references/package-docs.md - Test suite creation and scoring patterns

When to Use

Test LLM output quality
Evaluate RAG retrieval accuracy
Benchmark different prompts/models
Score AI-generated responses
Create LLM test suites

When NOT to Use

Just running one test case manually
Traditional unit testing (use testthat)
Non-LLM code testing

Quick Reference

library(vitals)

# Create test dataset
test_cases <- tibble::tibble(
  input = c("question 1", "question 2"),
  target = c("answer 1", "answer 2")
)

# Define solver (your LLM pipeline)
chat <- chat_openai()
solver <- function(input) {
  chat$chat(input, echo = "none")
}

# Run evaluation
task <- Task$new(
  dataset = test_cases,
  solver = solver,
  scorer = model_graded_qa()
)
task$run()
task$view()

# Test RAG system
ragnar_register_tool_retrieve(chat, store)
task$run(chat)  # Tests with RAG

Common Mistakes

Issue	Solution
No test dataset	Create tibble with input/target columns
Solver not function	Wrap chat in function: `function(input) chat$chat(input)`
Using for non-LLM tests	Use testthat for traditional testing
Forgetting echo = "none"	Solver should return text, not print

Core Functions

Task Management:

Task$new(): Create evaluation task
task$run(): Execute tests
task$view(): View results

Scorers:

model_graded_qa(): LLM grades Q&A quality
Custom scorers for specific domains

Advanced

See references/ for:

API.md: Complete function reference
Package docs: Full package documentation

Integration

With ellmer: Test chat quality With ragnar: Evaluate RAG accuracy Cross-package patterns: See r-ai meta-skill