Skill

llm-evaluation

LLM testing and evaluation tools including promptfoo, trulens, and evals frameworks

Install

npx claudepluginhub flexnetos/ripple-env

Tool Access

This skill uses the workspace's default tool permissions.

Preview

This skill provides expertise in testing, evaluating, and monitoring LLM-based systems for robotics agents and AI assistants.

SKILL.md

Similar Skills

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

159.9k

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

159.9k

next-compile

1 file

Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.

vercel-next-js-2

139.2k

Stats

Stars0

Forks0

Last CommitJan 9, 2026

Actions

View Source View Plugin View on GitHub View README

LLM Evaluation Skills

Overview

This skill provides expertise in testing, evaluating, and monitoring LLM-based systems for robotics agents and AI assistants.

promptfoo - LLM Testing Framework

promptfoo is a CLI tool for testing and evaluating LLM prompts systematically.

Installation

# Via npm
npm install -g promptfoo

# Or via npx (no install)
npx promptfoo@latest

# Via Nix
nix profile install nixpkgs#promptfoo

Basic Configuration

# promptfooconfig.yaml
description: "Robot command parser evaluation"

providers:
  - openai:gpt-4
  - anthropic:claude-3-sonnet-20240229
  - ollama:llama2

prompts:
  - file://prompts/robot_command.txt
  - |
    Parse the following user command into a structured robot action:
    Command: {{command}}

    Output JSON with: action, target, parameters

tests:
  - vars:
      command: "Move forward 2 meters"
    assert:
      - type: contains-json
      - type: javascript
        value: output.action === 'move'

  - vars:
      command: "Pick up the red box"
    assert:
      - type: contains-json
      - type: javascript
        value: output.action === 'pick' && output.target === 'red box'

  - vars:
      command: "Navigate to the charging station"
    assert:
      - type: llm-rubric
        value: "The output should identify navigation as the action and charging station as the destination"

Test Assertions

tests:
  # Exact match
  - vars: { input: "hello" }
    assert:
      - type: equals
        value: "Hello! How can I help?"

  # Contains substring
  - assert:
      - type: contains
        value: "error"

  # JSON validation
  - assert:
      - type: is-json
      - type: contains-json
        value:
          action: "move"

  # JavaScript evaluation
  - assert:
      - type: javascript
        value: |
          const parsed = JSON.parse(output);
          return parsed.confidence > 0.8;

  # LLM-as-judge
  - assert:
      - type: llm-rubric
        value: "Response should be helpful and accurate"

  # Latency check
  - assert:
      - type: latency
        threshold: 2000  # ms

  # Cost check
  - assert:
      - type: cost
        threshold: 0.01  # dollars

  # Similarity
  - assert:
      - type: similar
        value: "expected response"
        threshold: 0.8

Running Tests

# Run evaluation
promptfoo eval

# With specific config
promptfoo eval -c promptfooconfig.yaml

# View results in browser
promptfoo view

# Output formats
promptfoo eval --output results.json
promptfoo eval --output results.csv

# Compare providers
promptfoo eval --table

CI/CD Integration

# .github/workflows/llm-tests.yml
name: LLM Evaluation

on:
  push:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install promptfoo
        run: npm install -g promptfoo

      - name: Run evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          promptfoo eval --output results.json

      - name: Check for failures
        run: |
          if jq -e '.results.stats.failures > 0' results.json; then
            echo "LLM tests failed!"
            exit 1
          fi

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: llm-eval-results
          path: results.json

Robot Command Parser Example

# promptfooconfig.yaml
description: "Robot NLU evaluation suite"

providers:
  - id: openai:gpt-4
    config:
      temperature: 0
  - id: anthropic:claude-3-sonnet-20240229
    config:
      temperature: 0

prompts:
  - |
    You are a robot command parser. Convert natural language commands to JSON.

    Output format:
    {
      "action": "move|pick|place|navigate|stop|rotate",
      "parameters": {...},
      "confidence": 0.0-1.0
    }

    Command: {{command}}

tests:
  # Movement commands
  - vars:
      command: "Go forward 3 meters"
    assert:
      - type: javascript
        value: |
          const r = JSON.parse(output);
          return r.action === 'move' && r.parameters.distance === 3;

  - vars:
      command: "Turn left 90 degrees"
    assert:
      - type: javascript
        value: |
          const r = JSON.parse(output);
          return r.action === 'rotate' && r.parameters.angle === 90;

  # Manipulation commands
  - vars:
      command: "Pick up the blue cube from the table"
    assert:
      - type: javascript
        value: |
          const r = JSON.parse(output);
          return r.action === 'pick' &&
                 r.parameters.object.includes('blue') &&
                 r.parameters.location === 'table';

  # Navigation commands
  - vars:
      command: "Go to the kitchen"
    assert:
      - type: javascript
        value: |
          const r = JSON.parse(output);
          return r.action === 'navigate' &&
                 r.parameters.destination === 'kitchen';

  # Edge cases
  - vars:
      command: "Do a backflip"
    assert:
      - type: javascript
        value: |
          const r = JSON.parse(output);
          return r.confidence < 0.5 || r.action === 'unknown';

TruLens - LLM App Evaluation

TruLens provides feedback functions and tracing for LLM applications.

Installation

pixi add trulens-eval

Basic Usage

from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback import Groundedness, GroundTruthAgreement
from trulens_eval.feedback.provider import OpenAI

# Initialize
tru = Tru()

# Create feedback functions
provider = OpenAI()

f_relevance = Feedback(provider.relevance).on_input_output()
f_coherence = Feedback(provider.coherence).on_output()
f_groundedness = Feedback(
    Groundedness(groundedness_provider=provider).groundedness_measure_with_cot_reasons
).on(
    TruChain.select_context(),
    output=True
)

# Wrap your LLM app
from langchain.chains import LLMChain

tru_chain = TruChain(
    chain,
    app_id="robot_assistant",
    feedbacks=[f_relevance, f_coherence, f_groundedness]
)

# Run with evaluation
with tru_chain as recording:
    response = tru_chain("Navigate to the warehouse")

# View results
tru.run_dashboard()

Custom Feedback Functions

from trulens_eval import Feedback, Select
from trulens_eval.feedback.provider import OpenAI

provider = OpenAI()

# Custom: Check if response is a valid robot command
def is_valid_robot_command(response: str) -> float:
    """Check if the response is a valid robot command JSON."""
    try:
        import json
        data = json.loads(response)
        required = ['action', 'parameters', 'confidence']
        if all(k in data for k in required):
            return 1.0
        return 0.5
    except:
        return 0.0

f_valid_command = Feedback(is_valid_robot_command).on_output()

# Custom: Safety check
def safety_check(response: str) -> float:
    """Check if command is safe to execute."""
    unsafe_keywords = ['maximum speed', 'override safety', 'ignore obstacle']
    response_lower = response.lower()
    for keyword in unsafe_keywords:
        if keyword in response_lower:
            return 0.0
    return 1.0

f_safety = Feedback(safety_check).on_output()

RAG Evaluation

from trulens_eval import TruChain
from trulens_eval.feedback import Groundedness

# For RAG systems
groundedness = Groundedness(groundedness_provider=OpenAI())

f_groundedness = (
    Feedback(groundedness.groundedness_measure_with_cot_reasons)
    .on(Select.RecordCalls.retrieve.rets.collect())  # Context
    .on_output()  # Response
    .aggregate(groundedness.grounded_statements_aggregator)
)

f_context_relevance = (
    Feedback(provider.qs_relevance)
    .on_input()
    .on(Select.RecordCalls.retrieve.rets[:])
    .aggregate(np.mean)
)

Evaluation Metrics

Standard Metrics

Metric	Description	Use Case
Accuracy	Exact match rate	Classification
F1 Score	Precision/recall balance	Entity extraction
BLEU	N-gram overlap	Translation
ROUGE	Recall-based overlap	Summarization
BERTScore	Semantic similarity	General text

LLM-Specific Metrics

Metric	Description	Use Case
Relevance	Response addresses query	All
Coherence	Logical flow	Long-form
Groundedness	Factual accuracy	RAG
Helpfulness	Utility to user	Assistants
Harmlessness	Safety check	All

Robotics-Specific Metrics

Metric	Description	Use Case
Command Validity	Parseable to robot action	NLU
Safety Score	No dangerous commands	All
Confidence Calibration	Uncertainty estimation	Decision-making
Latency	Response time	Real-time

Best Practices

Test Organization

prompts/
├── robot_commands/
│   ├── movement.yaml
│   ├── manipulation.yaml
│   └── navigation.yaml
├── safety/
│   ├── boundary_checks.yaml
│   └── adversarial.yaml
└── performance/
    └── latency_tests.yaml

Evaluation Pipeline

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Develop   │────>│   Evaluate  │────>│   Deploy    │
│   Prompts   │     │   (CI/CD)   │     │   (Gated)   │
└─────────────┘     └─────────────┘     └─────────────┘
       │                   │                   │
       v                   v                   v
  Local testing      Automated tests     Production
  (promptfoo)        (promptfoo +        monitoring
                      TruLens)           (TruLens)

Guidelines

Version prompts - Track prompt changes in git
Test edge cases - Include adversarial and boundary tests
Monitor in production - Use TruLens for ongoing evaluation
Set quality gates - Block deployment on test failures
Use multiple models - Compare performance across providers

Related Skills

AI Assistants - AI tool usage
DevOps - CI/CD integration
ROS2 Development - Robot integration