By driangle
Create and manage skival eval suites for evaluating and comparing AI agent configurations
A Go CLI for evaluating AI coding skill performance. Measures time to completion, token usage, dollar cost, and correctness across configurable eval suites.
Define a control case and N treatment variations, then compare them head-to-head with statistical rigor.
# Define your eval suite
cat > suite.yaml <<EOF
version: 1
description: "My first eval suite"
evals:
- id: hello-world
prompt: "Create a hello world program in Go"
model: "claude-sonnet-4-6"
correctness:
expected_output: ["Hello, world!"]
treatments:
control:
name: "baseline"
variations:
- name: "with-skill"
skill: "./skills/my-skill"
EOF
# Run the eval
skival run suite.yaml --samples 3 --results-dir ./results
skival run <suite.yaml> Run an eval suite
skival validate <suite.yaml> Validate suite structure without executing
skival report <results-dir> Generate reports from saved results
| Flag | Description |
|---|---|
--samples N | Number of runs per treatment (default: 1) |
--results-dir | Directory for results output |
--treatments | Filter to specific treatments |
--evals | Filter to specific eval IDs |
--format | Output format: markdown, json (default: markdown) |
-v, --verbose | Enable debug-level logging |
See the documentation site for the full configuration schema, verifier reference, and CLI guide.
MIT
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
MCP server for taskmd. Provides direct tool access for task operations (list, get, next, search, set, validate, graph) without shelling out to the CLI.
Manage tasks stored as markdown files with taskmd. List, create, complete, and validate tasks directly from Claude Code.
CLI-free taskmd plugin. Manage tasks stored as markdown files using only Claude's native tools — no CLI binary required.
Versioned releases with automated version bumps, tagging, release notes, and GitHub release publishing
npx claudepluginhub driangle/skival --plugin skivalAgent and skill evaluation harness with MLflow integration
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
SDK Usability Benchmark — generate, execute, judge, and analyze AI agent benchmark suites
Open-source testing and regression detection framework for AI agents. Golden baseline diffing, CI/CD integration, works with LangGraph, CrewAI, OpenAI, Anthropic Claude, HuggingFace, Ollama, and MCP.
Set up evaluation of AI agents with tool call validation, correctness checks, task completion, and tool reliability using Dokimos. Framework-agnostic — works with any agent framework.
Skill evaluation and benchmarking - test skill effectiveness with behavioral eval cases, grade results, and track quality improvements