From omni
Evaluates AI-generated code quality using ICE Score (functional correctness + usefulness) and Code Judge metrics via LLM APIs. Use to assess against requirements, compare implementations, or score consistency.
npx claudepluginhub zte-aicloud/co-omnispec --plugin omniThis skill uses the workspace's default tool permissions.
This skill evaluates code generation quality using standardized metrics including ICE Score (Functional Correctness + Usefulness) and Code Judge assessments.
README.mdevals/evals.jsonprompts/code_judge_no_answer_v1.jinja2prompts/code_judge_no_answer_v2.jinja2prompts/code_judge_with_answer_v1.jinja2prompts/code_judge_with_answer_v2.jinja2prompts/ice_score_functional_correctness_no_answer.jinja2prompts/ice_score_functional_correctness_with_answer.jinja2prompts/ice_score_usefulness_no_answer.jinja2prompts/ice_score_usefulness_with_answer.jinja2references/config-example.jsonscripts/evaluate_code.pyscripts/judge_model_metrics_standalone.pyscripts/prompts/code_judge_no_answer_v1.jinja2scripts/prompts/code_judge_no_answer_v2.jinja2scripts/prompts/code_judge_with_answer_v1.jinja2scripts/prompts/code_judge_with_answer_v2.jinja2scripts/prompts/ice_score_functional_correctness_no_answer.jinja2scripts/prompts/ice_score_functional_correctness_with_answer.jinja2scripts/prompts/ice_score_usefulness_no_answer.jinja2Evaluates AI-generated code quality using ICE Score (functional correctness, usefulness) and Code Judge metrics. Compares implementations, scores consistency, and lists inconsistencies against requirements.
Assesses AI contribution in software projects using the 5-level AI Assessment Scale (AIAS). Evaluates involvement from no AI to full exploration for transparency and documentation.
Share bugs, ideas, or general feedback.
This skill evaluates code generation quality using standardized metrics including ICE Score (Functional Correctness + Usefulness) and Code Judge assessments.
Requirements/Feature Description - What the code is supposed to do
Generated Code - The code to be evaluated
Create a configuration with your LLM API details:
{
"api": {
"url": "your-api-endpoint",
"key": "Bearer your-api-key",
"model_name": "model-name",
"timeout": 60,
"max_tokens": 16384,
"temperature": 0.1
}
}
pip install requests jinja2 loguru
Input:
Output:
{
"avg_llm_judge_metric": 0.8333,
"ice_score": {
"functional_correctness": {"score": 0.75, ...},
"usefulness": {"score": 1.0, ...}
},
"code_judge": {
"score": 0.75,
"inconsistencies": [...],
"inconsistencies_count": 1
}
}
Input:
Benefit: More accurate evaluation with direct comparison
The skill handles common errors gracefully:
All errors are reported in the output with descriptive messages.
The skill uses optimized prompt templates stored in prompts/:
评测结果
代码已使用xx模型完成评测,以下是详细结果:
📊 综合评分
平均LLM评测指标:
功能正确性:
实用性:
代码一致性:
🔍 详细分析
功能正确性 (0.5/1.0)
优点:
xxx
主要问题:
xxx
实用性 (0.75/1.0)
优点:
xxx
主要问题:
xxx
代码一致性 (0.0/1.0)
发现的不一致问题:
xxx
💡 改进建议
总结:xxxx