Skip to content

feat(eval): add CI workflow for A2UI evaluations#1350

Closed
gspencergoog wants to merge 0 commit intogoogle:mainfrom
gspencergoog:evals_ci
Closed

feat(eval): add CI workflow for A2UI evaluations#1350
gspencergoog wants to merge 0 commit intogoogle:mainfrom
gspencergoog:evals_ci

Conversation

@gspencergoog
Copy link
Copy Markdown
Collaborator

  • Create run_ci_evals.py to run evals with daily shuffle and limit.
  • Incorporate pass_percentage logic into run_ci_evals.py to remove jq dependency.
  • Delete pass_percentage script.
  • Add .github/workflows/run_evals.yml to run evals on PR and push to main with an 85% threshold.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces a bash script with a Python-based CI evaluation script, run_ci_evals.py. The new script automates running evaluations via inspect, calculates accuracy from the resulting logs, and enforces a pass threshold. Feedback from the review includes fixing a potential NoneType error in argument parsing, ensuring consistent seed generation by using UTC time, allowing the model name to be configured via environment variables, and using modification time for more reliable log file discovery.

Comment thread eval/bin/run_ci_evals.py Outdated
Comment thread eval/bin/run_ci_evals.py Outdated
Comment thread eval/bin/run_ci_evals.py Outdated
Comment thread eval/bin/run_ci_evals.py Outdated
@gspencergoog gspencergoog force-pushed the evals_ci branch 3 times, most recently from bc75b24 to c04e552 Compare May 7, 2026 16:22
@github-project-automation github-project-automation Bot moved this from Todo to Done in A2UI May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant