improve: enhance model-evaluator#566
Merged
Merged
Conversation
…nt tooling - Rewrote description with three <example> blocks covering task-specific model selection, benchmark design, and post-deployment regression testing - Added Standard Frameworks & Tools section with HELM, lm-evaluation-harness, DeepEval, RAGAS, Promptfoo, and Chatbot Arena - Updated Model Categories to tier-based language (Haiku/Sonnet/Opus) and current model names (GPT-4o, Gemini 1.5/2.0), removed deprecated version numbers - Added model: sonnet frontmatter and expanded tools list (added Edit, Glob, Grep) - Replaced skeletal Python stub with Statistical Requirements section (sample sizes, CI, effect size, Cohen's kappa, Bonferroni correction) - Added Integration with Other Agents section mapping handoffs to llm-architect, prompt-engineer, and ai-ethics-advisor - Added Step 6 Post-Deployment Monitoring (drift detection, re-evaluation triggers, Arize Phoenix, LangSmith, Promptfoo CI) - Removed emoji from output template header Automated review cycle | Co-Authored-By: Claude Code <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
👋 Thanks for contributing, @davila7!This PR touches What happens next
While you wait
This is an automated message. No action is required from you right now — a maintainer will review soon. |
Contributor
|
| Metric | Count |
|---|---|
| Total Components | 763 |
| ✅ Passed | 359 |
| ❌ Failed | 404 |
| 1005 |
❌ Failed Components (Top 5)
| Component | Errors | Warnings | Score |
|---|---|---|---|
vercel-edge-function |
3 | 4 | 81/100 |
prompt-engineer |
2 | 0 | 90/100 |
neon-expert |
2 | 2 | 88/100 |
agent-overview |
2 | 1 | 89/100 |
unused-code-cleaner |
2 | 1 | 89/100 |
...and 399 more failed component(s)
📊 View Full Report for detailed error messages and all components
davila7
added a commit
that referenced
this pull request
May 5, 2026
Reflects merged improvements to cli-tool/components/agents/ai-specialists/model-evaluator.md. Automated by pr-verification cycle | Co-Authored-By: Claude Code <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated Component Improvement
Changes
descriptionwith three<example>blocks followingllm-architectformat, covering: task-specific model selection, benchmark design from scratch, and post-deployment regression testing — each withContext:,user:,assistant:, and<commentary>explaining delegation boundaries vsllm-architectandprompt-engineermodel: sonnetto YAML frontmatter; expandedtoolstoRead, Write, Edit, Bash, Glob, Grep, WebSearchevaluate_code_modelPython stub with Statistical Requirements section: minimum sample sizes, 95% CI reporting, effect size (Cohen's d/kappa), inter-rater reliability threshold (kappa > 0.8), Bonferroni correction for multiple comparisons, paired test guidancellm-architect,prompt-engineer, andai-ethics-advisor🎯emoji from output template headerResearch Summary
The component had solid coverage of evaluation dimensions and cost analysis but lacked the
<example>-block description format used by peer agents, referenced outdated model names, contained a skeletal Python stub with no implementation value, and had no statistical rigor guidance, no framework recommendations, and no post-deployment monitoring step. All seven prioritized improvements from the research report have been applied.Validation
Automated review cycle by Component Improvement Loop
Summary by cubic
Modernizes the model-evaluator with current model taxonomy, concrete examples, statistical standards, and monitoring to make evaluations reliable end to end. Affects components (
cli-tool/components/).llm-architect,prompt-engineer, andai-ethics-advisor.HELM,lm-evaluation-harness,DeepEval,RAGAS,Promptfoo,Chatbot Arena) and post-deploy monitoring (drift alerts viaArize Phoenix,LangSmith,PromptfooCI).docs/components.json) unchanged; no new env vars or secrets.Written for commit 6bea3e8. Summary will update on new commits.