Rhesis AI Logo

Rhesis: Collaborative Testing for LLM & Agentic Applications

Website · Docs · Discord · Changelog

More than just evals.
Collaborative agent testing for teams.

Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.

Core features

Rhesis Core Features

Test generation

AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.

Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.

Single-turn & conversation simulation

Single-turn for Q&A validation. Conversation simulation for dialogue flows.

Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.

Adversarial testing (red-teaming)

Polyphemus Agent proactively finds vulnerabilities:

Jailbreak attempts and prompt injection
PII leakage and data extraction
Harmful content generation
Role violation and instruction bypassing

Garak Integration - Built-in support for garak, the LLM vulnerability scanner, for comprehensive security testing.

60+ pre-built metrics

Framework	Example Metrics
RAGAS	Context relevance, faithfulness, answer accuracy
DeepEval	Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention
Garak	Jailbreak detection, prompt injection, XSS, malware generation, data leakage
Custom	NumericJudge, CategoricalJudge for domain-specific evaluation

All metrics include LLM-as-Judge reasoning explanations.

Traces & observability

Monitor your LLM applications with OpenTelemetry-based tracing:

from rhesis.sdk.decorators import observe

@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
    # Your LLM call here
    return response

Track LLM calls, latency, token usage, and link traces to test results for debugging.

Bring your own model

Use any LLM provider for test generation and evaluation:

Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

Local/Self-hosted: Ollama, vLLM, LiteLLM

See Model Configuration Docs for setup instructions.

Why Rhesis?

Platform for teams. SDK for developers.

Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.

The testing lifecycle

Rhesis AI Logo

Rhesis: Collaborative Testing for LLM & Agentic Applications

Website · Docs · Discord · Changelog

More than just evals.
Collaborative agent testing for teams.

Core features

Rhesis Core Features

Test generation

AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.

Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.

Single-turn & conversation simulation

Single-turn for Q&A validation. Conversation simulation for dialogue flows.

Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.

Adversarial testing (red-teaming)

Polyphemus Agent proactively finds vulnerabilities:

Jailbreak attempts and prompt injection
PII leakage and data extraction
Harmful content generation
Role violation and instruction bypassing

Garak Integration - Built-in support for garak, the LLM vulnerability scanner, for comprehensive security testing.

60+ pre-built metrics

Framework	Example Metrics
RAGAS	Context relevance, faithfulness, answer accuracy
DeepEval	Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention
Garak	Jailbreak detection, prompt injection, XSS, malware generation, data leakage
Custom	NumericJudge, CategoricalJudge for domain-specific evaluation

All metrics include LLM-as-Judge reasoning explanations.

Traces & observability

Monitor your LLM applications with OpenTelemetry-based tracing:

from rhesis.sdk.decorators import observe

@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
    # Your LLM call here
    return response

Track LLM calls, latency, token usage, and link traces to test results for debugging.

Bring your own model

Use any LLM provider for test generation and evaluation:

Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

Local/Self-hosted: Ollama, vLLM, LiteLLM

See Model Configuration Docs for setup instructions.

Why Rhesis?

Platform for teams. SDK for developers.

rhesis

Component Overview

Component Details

MCP Servers (1)

README

Rhesis: Collaborative Testing for LLM & Agentic Applications

More than just evals.Collaborative agent testing for teams.

Core features

Test generation

Single-turn & conversation simulation

Adversarial testing (red-teaming)

60+ pre-built metrics

Traces & observability

Bring your own model

Why Rhesis?

The testing lifecycle

Similar Plugins

evalview

Help us improve

Help us improve

rhesis

Component Overview

Component Details

MCP Servers (1)

README

Rhesis: Collaborative Testing for LLM & Agentic Applications

More than just evals.Collaborative agent testing for teams.

Core features

Test generation

Single-turn & conversation simulation

Adversarial testing (red-teaming)

60+ pre-built metrics

Traces & observability

Bring your own model

Why Rhesis?

The testing lifecycle

Similar Plugins

evalview

Help us improve

evaluate-agent

skill-optimizer

promptfoo-evals

api-test-automation

momentic-ai-skills-1

More than just evals.
Collaborative agent testing for teams.

More than just evals.
Collaborative agent testing for teams.