From omni
Evaluates AI-generated code quality using ICE Score (functional correctness, usefulness) and Code Judge metrics. Compares implementations, scores consistency, and lists inconsistencies against requirements.
npx claudepluginhub zte-aicloud/co-omnispec --plugin omniThis skill uses the workspace's default tool permissions.
This skill evaluates code generation quality using standardized metrics including ICE Score (Functional Correctness + Usefulness) and Code Judge assessments.
README.mdevals/evals.jsonprompts/code_judge_no_answer_v1.jinja2prompts/code_judge_no_answer_v2.jinja2prompts/code_judge_with_answer_v1.jinja2prompts/code_judge_with_answer_v2.jinja2prompts/ice_score_functional_correctness_no_answer.jinja2prompts/ice_score_functional_correctness_with_answer.jinja2prompts/ice_score_usefulness_no_answer.jinja2prompts/ice_score_usefulness_with_answer.jinja2references/config-example.jsonscripts/__pycache__/judge_model_metrics_standalone.cpython-311.pycscripts/evaluate_code.pyscripts/judge_model_metrics_standalone.pyscripts/prompts/code_judge_no_answer_v1.jinja2scripts/prompts/code_judge_no_answer_v2.jinja2scripts/prompts/code_judge_with_answer_v1.jinja2scripts/prompts/code_judge_with_answer_v2.jinja2scripts/prompts/ice_score_functional_correctness_no_answer.jinja2scripts/prompts/ice_score_functional_correctness_with_answer.jinja2Evaluates AI-generated code quality using ICE Score (functional correctness + usefulness) and Code Judge metrics via LLM APIs. Use to assess against requirements, compare implementations, or score consistency.
Assesses AI contribution in software projects using the 5-level AI Assessment Scale (AIAS). Evaluates involvement from no AI to full exploration for transparency and documentation.
Evaluates Claude Code packages across 6 quality dimensions like frontmatter and structure for all 7 package types, producing scored audit reports. Use for quick single-package audits or full repository scans.
Share bugs, ideas, or general feedback.
This skill evaluates code generation quality using standardized metrics including ICE Score (Functional Correctness + Usefulness) and Code Judge assessments.
Requirements/Feature Description - What the code is supposed to do
Generated Code - The code to be evaluated
Create a configuration with your LLM API details:
{
"api": {
"url": "your-api-endpoint",
"key": "Bearer your-api-key",
"model_name": "model-name",
"timeout": 60,
"max_tokens": 16384,
"temperature": 0.1
}
}
pip install requests jinja2 loguru
Input:
Output:
{
"avg_llm_judge_metric": 0.8333,
"ice_score": {
"functional_correctness": {"score": 0.75, ...},
"usefulness": {"score": 1.0, ...}
},
"code_judge": {
"score": 0.75,
"inconsistencies": [...],
"inconsistencies_count": 1
}
}
Input:
Benefit: More accurate evaluation with direct comparison
The skill handles common errors gracefully:
All errors are reported in the output with descriptive messages.
The skill uses optimized prompt templates stored in prompts/:
评测结果
代码已使用xx模型完成评测,以下是详细结果:
📊 综合评分
平均LLM评测指标:
功能正确性:
实用性:
代码一致性:
🔍 详细分析
功能正确性 (0.5/1.0)
优点:
xxx
主要问题:
xxx
实用性 (0.75/1.0)
优点:
xxx
主要问题:
xxx
代码一致性 (0.0/1.0)
发现的不一致问题:
xxx
💡 改进建议
总结:xxxx