Skill

eval-evaluator

Evaluates AI-generated code quality using ICE Score (functional correctness, usefulness) and Code Judge metrics. Compares implementations, scores consistency, and lists inconsistencies against requirements.

Python

code-quality

ai-ml

npx claudepluginhub zte-aicloud/co-omnispec --plugin omni

Tool Access

This skill uses the workspace's default tool permissions.

Preview

This skill evaluates code generation quality using standardized metrics including ICE Score (Functional Correctness + Usefulness) and Code Judge assessments.

Supporting Assets

README.mdevals/evals.jsonprompts/code_judge_no_answer_v1.jinja2prompts/code_judge_no_answer_v2.jinja2prompts/code_judge_with_answer_v1.jinja2prompts/code_judge_with_answer_v2.jinja2prompts/ice_score_functional_correctness_no_answer.jinja2prompts/ice_score_functional_correctness_with_answer.jinja2prompts/ice_score_usefulness_no_answer.jinja2prompts/ice_score_usefulness_with_answer.jinja2references/config-example.jsonscripts/__pycache__/judge_model_metrics_standalone.cpython-311.pycscripts/evaluate_code.pyscripts/judge_model_metrics_standalone.pyscripts/prompts/code_judge_no_answer_v1.jinja2scripts/prompts/code_judge_no_answer_v2.jinja2scripts/prompts/code_judge_with_answer_v1.jinja2scripts/prompts/code_judge_with_answer_v2.jinja2scripts/prompts/ice_score_functional_correctness_no_answer.jinja2scripts/prompts/ice_score_functional_correctness_with_answer.jinja2

SKILL.md

Similar Skills

judge-model-evaluator

Evaluates AI-generated code quality using ICE Score (functional correctness + usefulness) and Code Judge metrics via LLM APIs. Use to assess against requirements, compare implementations, or score consistency.

20 files

omni

ai-assessment-scale

Assesses AI contribution in software projects using the 5-level AI Assessment Scale (AIAS). Evaluates involvement from no AI to full exploration for transparency and documentation.

mastepanoski-claude-skills

package-evaluator

221

Evaluates Claude Code packages across 6 quality dimensions like frontmatter and structure for all 7 package types, producing scored audit reports. Use for quick single-package audits or full repository scans.

2 files

armory

Stats

Stars45

Forks2

Last CommitApr 30, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Judge Model Evaluator Skill

This skill evaluates code generation quality using standardized metrics including ICE Score (Functional Correctness + Usefulness) and Code Judge assessments.

When to Use

Evaluating AI-generated code against requirements
Assessing code quality in automated workflows
Comparing multiple code implementations
Getting structured feedback on code changes
Validating that code meets specification requirements

Quick Start

Ensure you have API access configured
Provide the requirement description and generated code
Optionally provide reference code for comparison
Get comprehensive evaluation scores

What You Need to Provide

Required Inputs

Requirements/Feature Description - What the code is supposed to do
- Format: List of change descriptions or natural language requirements
Generated Code - The code to be evaluated
- Format: String or structured code blocks with file paths

Optional Inputs

Reference Code - Correct/expected implementation (if available)
- Improves evaluation accuracy
- Format: List of code answers with file paths and snippets

Evaluation Metrics

ICE Score Components

Functional Correctness (0-1): Does the code correctly implement the requirements?
Usefulness (0-1): Is the code practical and well-structured?

Code Judge Components

Score (0-1): Overall code consistency and quality
Inconsistencies: Detailed list of issues found
- Severity levels: Small, Major, Fatal
Inconsistencies Count: Number of issues found

Overall Metric

Average LLM Judge Metric: Combined score across all metrics

Setup Instructions

1. Configure API Access

Create a configuration with your LLM API details:

{
  "api": {
    "url": "your-api-endpoint",
    "key": "Bearer your-api-key",
    "model_name": "model-name",
    "timeout": 60,
    "max_tokens": 16384,
    "temperature": 0.1
  }
}

2. Install Dependencies

pip install requests jinja2 loguru

Usage Examples

Example 1: Basic Evaluation

Input:

Requirements: "Add user authentication with JWT tokens"
Generated Code: Python implementation of auth system

Output:

{
  "avg_llm_judge_metric": 0.8333,
  "ice_score": {
    "functional_correctness": {"score": 0.75, ...},
    "usefulness": {"score": 1.0, ...}
  },
  "code_judge": {
    "score": 0.75,
    "inconsistencies": [...],
    "inconsistencies_count": 1
  }
}

Example 2: With Reference Code

Input:

Requirements: "Implement sorting algorithm"
Generated Code: Bubble sort implementation
Reference Code: Optimized quicksort

Benefit: More accurate evaluation with direct comparison

Error Handling

The skill handles common errors gracefully:

API connection failures
Invalid input formats
Template rendering errors
JSON parsing issues

All errors are reported in the output with descriptive messages.

Best Practices

Provide Clear Requirements: Be specific about what the code should do
Include Context: Add relevant domain information if needed
Use Reference Code: When available, significantly improves evaluation accuracy
Check API Configuration: Ensure proper API access before running
Review Detailed Feedback: Look at individual metric scores and justifications

Template Files

The skill uses optimized prompt templates stored in prompts/:

ICE Score templates for functional correctness and usefulness
Code Judge templates with/without reference answers
Both v1 and v2 variants for different evaluation strategies

Output Interpretation

Scores close to 1.0: Excellent implementation
Scores around 0.5-0.7: Acceptable with room for improvement
Scores below 0.5: Significant issues detected
Inconsistencies: Detailed feedback for improvement

Output template

评测结果

代码已使用xx模型完成评测，以下是详细结果：

📊 综合评分

平均LLM评测指标:

功能正确性:
实用性:
代码一致性:

🔍 详细分析

功能正确性 (0.5/1.0)

优点:
xxx

主要问题:
xxx

实用性 (0.75/1.0)

优点:
xxx

主要问题:
xxx

代码一致性 (0.0/1.0)

发现的不一致问题:
xxx

💡 改进建议

总结：xxxx