Rhesis: Collaborative Testing for LLM & Agentic Applications
Website ·
Docs ·
Discord ·
Changelog
More than just evals.
Collaborative agent testing for teams.
Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.
Core features
Test generation
AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.
Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.
Single-turn & conversation simulation
Single-turn for Q&A validation. Conversation simulation for dialogue flows.
Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.
Adversarial testing (red-teaming)
Polyphemus Agent proactively finds vulnerabilities:
- Jailbreak attempts and prompt injection
- PII leakage and data extraction
- Harmful content generation
- Role violation and instruction bypassing
Garak Integration - Built-in support for garak, the LLM vulnerability scanner, for comprehensive security testing.
60+ pre-built metrics
| Framework | Example Metrics |
|---|
| RAGAS | Context relevance, faithfulness, answer accuracy |
| DeepEval | Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention |
| Garak | Jailbreak detection, prompt injection, XSS, malware generation, data leakage |
| Custom | NumericJudge, CategoricalJudge for domain-specific evaluation |
All metrics include LLM-as-Judge reasoning explanations.
Traces & observability
Monitor your LLM applications with OpenTelemetry-based tracing:
from rhesis.sdk.decorators import observe
@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
# Your LLM call here
return response
Track LLM calls, latency, token usage, and link traces to test results for debugging.
Bring your own model
Use any LLM provider for test generation and evaluation:
Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI
Local/Self-hosted: Ollama, vLLM, LiteLLM
See Model Configuration Docs for setup instructions.
Why Rhesis?
Platform for teams. SDK for developers.
Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.
The testing lifecycle