Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
By yzavyas
Write rigorous evals for LLM agents, skills, MCP servers, and prompts. Use when: building test suites, measuring effectiveness, choosing frameworks. Covers: DeepEval, Braintrust, RAGAS, precision/recall, F1.
npx claudepluginhub yzavyas/claude-1337 --plugin eval-1337A marketplace of cognitive extensions for Claude Code.
📚 Documentation · 🔍 Catalog · 💡 Ethos
/plugin marketplace add yzavyas/claude-1337
/plugin install core-1337@claude-1337
Known issues: #14815, #14061, #15369
Workaround:
~/.claude/plugins/marketplaces/claude-1337/plugins/core-1337/scripts/install-workaround.sh
Development happens on the dev branch. This main branch is for marketplace distribution only.
git checkout dev
See CONTRIBUTING.md or the contributor guide.
MIT
Share bugs, ideas, or general feedback.
Based on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Agent and skill evaluation harness with MLflow integration
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
Set up evaluation of AI agents with tool call validation, correctness checks, task completion, and tool reliability using Dokimos. Framework-agnostic — works with any agent framework.
Skills for building LLM evaluations: pipeline audit, error analysis, synthetic data generation, LLM-as-Judge design, evaluator validation, RAG evaluation, and annotation interfaces.
Skills for adding DeepEval evaluations, tracing, datasets, Confident AI reports, and iterative improvement loops to AI applications.
Open-source testing and regression detection framework for AI agents. Golden baseline diffing, CI/CD integration, works with LangGraph, CrewAI, OpenAI, Anthropic Claude, HuggingFace, Ollama, and MCP.
AI image and video generation. Use when: Midjourney prompting, choosing image/video models, troubleshooting AI art, reference types, style transfer, text-in-image.
Rust production patterns. Use when: building Rust systems. Covers ownership decisions, async gotchas, crate selection, domain knowledge (networking, embedded, WASM, FFI, proc-macros).
JVM static and runtime analysis. Use when: finding dead code, optimizing Java/Kotlin apps, profiling, debugging memory leaks. Covers SootUp, Scavenger, async-profiler, JFR, ProGuard.
Architectural reasoning with The Guild. 13 specialized agents with orthogonal perspectives for multi-viewpoint architecture review.
Frontend experience engineering. Use when: animations, 3D/WebGL, scrollytelling, data visualization, typography, color systems, design patterns. Covers Motion, GSAP, Three.js, R3F, Threlte, D3, OKLCH, fluid type.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claim