improve: enhance model-evaluator by davila7 · Pull Request #566 · davila7/claude-code-templates

davila7 · 2026-05-05T20:16:54Z

Automated Component Improvement

Changes

Rewrote description with three <example> blocks following llm-architect format, covering: task-specific model selection, benchmark design from scratch, and post-deployment regression testing — each with Context:, user:, assistant:, and <commentary> explaining delegation boundaries vs llm-architect and prompt-engineer
Added Standard Frameworks & Tools section with a Markdown table covering HELM, EleutherAI lm-evaluation-harness, DeepEval, RAGAS, Promptfoo, and Chatbot Arena with "Best For / When to Use" guidance
Updated Model Categories to tier-based language (Haiku/Sonnet/Opus) and current model names (GPT-4o, GPT-4o-mini, Gemini 1.5/2.0, Llama 3, Mistral, Qwen); removed deprecated GPT-3.5 and Gemini Pro/Ultra references; added note to verify current model IDs
Added model: sonnet to YAML frontmatter; expanded tools to Read, Write, Edit, Bash, Glob, Grep, WebSearch
Replaced skeletal evaluate_code_model Python stub with Statistical Requirements section: minimum sample sizes, 95% CI reporting, effect size (Cohen's d/kappa), inter-rater reliability threshold (kappa > 0.8), Bonferroni correction for multiple comparisons, paired test guidance
Added Integration with Other Agents table mapping handoffs to llm-architect, prompt-engineer, and ai-ethics-advisor
Added Step 6: Post-Deployment Monitoring covering drift detection, re-evaluation triggers, alerting thresholds, and tools (Arize Phoenix, LangSmith, Promptfoo CI)
Removed 🎯 emoji from output template header

Research Summary

The component had solid coverage of evaluation dimensions and cost analysis but lacked the <example>-block description format used by peer agents, referenced outdated model names, contained a skeletal Python stub with no implementation value, and had no statistical rigor guidance, no framework recommendations, and no post-deployment monitoring step. All seven prioritized improvements from the research report have been applied.

Validation

component-reviewer: PASSED
- Valid YAML frontmatter with all required fields (name, description, model, tools)
- kebab-case naming consistent with filename
- No hardcoded secrets or API keys
- No absolute paths
- File in correct category directory (ai-specialists)
- No emoji in output template

Automated review cycle by Component Improvement Loop

Summary by cubic

Modernizes the model-evaluator with current model taxonomy, concrete examples, statistical standards, and monitoring to make evaluations reliable end to end. Affects components (cli-tool/components/).

New Features
- Rewrote the description with three blocks and clear handoffs to llm-architect, prompt-engineer, and ai-ethics-advisor.
- Added frameworks and tooling guidance (HELM, lm-evaluation-harness, DeepEval, RAGAS, Promptfoo, Chatbot Arena) and post-deploy monitoring (drift alerts via Arize Phoenix, LangSmith, Promptfoo CI).
- Updated model taxonomy and frontmatter (model: sonnet; tools expanded to Read, Write, Edit, Bash, Glob, Grep, WebSearch); removed deprecated refs and emoji.
- Replaced the code stub with statistical requirements (min samples, 95% CI, effect sizes, inter-rater reliability, multiple-comparison control). No new components; catalog (docs/components.json) unchanged; no new env vars or secrets.

^{Written for commit 6bea3e8. Summary will update on new commits.}

…nt tooling - Rewrote description with three <example> blocks covering task-specific model selection, benchmark design, and post-deployment regression testing - Added Standard Frameworks & Tools section with HELM, lm-evaluation-harness, DeepEval, RAGAS, Promptfoo, and Chatbot Arena - Updated Model Categories to tier-based language (Haiku/Sonnet/Opus) and current model names (GPT-4o, Gemini 1.5/2.0), removed deprecated version numbers - Added model: sonnet frontmatter and expanded tools list (added Edit, Glob, Grep) - Replaced skeletal Python stub with Statistical Requirements section (sample sizes, CI, effect size, Cohen's kappa, Bonferroni correction) - Added Integration with Other Agents section mapping handoffs to llm-architect, prompt-engineer, and ai-ethics-advisor - Added Step 6 Post-Deployment Monitoring (drift detection, re-evaluation triggers, Arize Phoenix, LangSmith, Promptfoo CI) - Removed emoji from output template header Automated review cycle | Co-Authored-By: Claude Code <noreply@anthropic.com>

vercel · 2026-05-05T20:17:00Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
aitmpl-dashboard	Ready	Preview, Comment	May 5, 2026 8:17pm
claude-code-templates	Ready	Preview, Comment	May 5, 2026 8:17pm

github-actions · 2026-05-05T20:17:04Z

👋 Thanks for contributing, @davila7!

This PR touches cli-tool/components/** and has been marked review-pending.

What happens next

🤖 Automated security audit runs and posts results on this PR.
👀 Maintainer review — a human reviewer validates the component with the component-reviewer agent (format, naming, security, clarity).
✅ Merge — once approved, your PR is merged to main.
📦 Catalog regeneration — the component catalog is rebuilt automatically.
🚀 Live on aitmpl.com — your component appears on the website after deploy.

While you wait

Check the Security Audit comment below for any issues to fix.
Make sure your component follows the contribution guide.

This is an automated message. No action is required from you right now — a maintainer will review soon.

github-actions · 2026-05-05T20:18:07Z

⚠️ Security Audit Report

Status: ❌ FAILED

Metric	Count
Total Components	763
✅ Passed	359
❌ Failed	404
⚠️ Warnings	1005

❌ Failed Components (Top 5)

Component	Errors	Warnings	Score
`vercel-edge-function`	3	4	81/100
`prompt-engineer`	2	0	90/100
`neon-expert`	2	2	88/100
`agent-overview`	2	1	89/100
`unused-code-cleaner`	2	1	89/100

...and 399 more failed component(s)

📊 View Full Report for detailed error messages and all components

cubic-dev-ai

No issues found across 1 file

Reflects merged improvements to cli-tool/components/agents/ai-specialists/model-evaluator.md. Automated by pr-verification cycle | Co-Authored-By: Claude Code <noreply@anthropic.com>

vercel Bot deployed to Preview – claude-code-templates May 5, 2026 20:16 View deployment

github-actions Bot added the review-pending Component PR awaiting maintainer review label May 5, 2026

vercel Bot deployed to Preview – aitmpl-dashboard May 5, 2026 20:17 View deployment

cubic-dev-ai Bot reviewed May 5, 2026

View reviewed changes

davila7 merged commit e923256 into main May 5, 2026
7 checks passed

davila7 deleted the review/model-evaluator-2026-05-05 branch May 5, 2026 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve: enhance model-evaluator#566

improve: enhance model-evaluator#566
davila7 merged 1 commit into
mainfrom
review/model-evaluator-2026-05-05

davila7 commented May 5, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

vercel Bot commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davila7 commented May 5, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Component Improvement

Changes

Research Summary

Validation

Summary by cubic

Uh oh!

vercel Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

👋 Thanks for contributing, @davila7!

What happens next

While you wait

Uh oh!

github-actions Bot commented May 5, 2026

⚠️ Security Audit Report

❌ Failed Components (Top 5)

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davila7 commented May 5, 2026 •

edited by cubic-dev-ai Bot

Loading

vercel Bot commented May 5, 2026 •

edited

Loading