Skip to content

Commit 52c365d

Browse files
authored
chore: parallelize nightly evaluations and fix suite timeouts (#472)
This PR overhauls the nightly evaluations suite to run significantly faster and eliminates several sources of non-deterministic timeouts and endless loops. ### 🚀 Enhancements * **Parallel Execution**: Refactored `evals-nightly.yml` to dynamically generate a matrix of all `.eval.ts` files and run them concurrently, drastically reducing the total suite execution time. * **Aggregated Reporting**: Enhanced `scripts/aggregate_evals.ts` to collect artifacts from parallel jobs and output a unified, structured Markdown table with pass rates, latencies, and detailed collapsible failure logs. ### 🛠️ Stability & Bug Fixes * **Issue Fixer Search Loops**: Scaffolded missing mock source files (e.g., `src/index.js`), a `package.json` with a valid `test` script, and a mock `gh` CLI executable in the `TestRig`. This prevents the agent from infinitely searching for non-existent files or trying to repair a broken mock testing environment. * **Assistant Polling Loops**: Injected the mock MCP server into `gemini-assistant.eval.ts` so it can successfully execute `add_issue_comment`. Updated the system prompt to explicitly instruct the agent to exit immediately after posting its plan, rather than infinitely polling the issue for an `@gemini-cli /approve` comment. Also explicitly defined the typo in `fix-typo` to stop the agent from hopelessly guessing. * **JSON Quoting Flakes**: Updated `gemini-scheduled-triage.toml` to output its JSON array to `$GITHUB_ENV` using a heredoc (`cat << 'EOF'`) instead of `echo '...'`. This prevents bash syntax errors when the model's generated text naturally contains single quotes. * **Global Timeout Boundaries**: Increased the `TestRig` hard-kill timeout from 3 to 10 minutes to safely accommodate complex, high-turn fixes (e.g., `fix-flaky-test`). * **Security Warnings**: Added a top-level `permissions: contents: read` block to the nightly workflow to resolve CodeQL linting warnings. Successful run: https://github.com/google-github-actions/run-gemini-cli/actions/runs/22689405186
1 parent f4d3932 commit 52c365d

13 files changed

+331
-122
lines changed

.github/commands/gemini-invoke.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ Begin every task by building a complete picture of the situation.
8282
Please review this plan. To approve, comment `@gemini-cli /approve` on this issue. To make changes, comment changes needed.
8383
```
8484
85-
3. **Post the Plan**: You MUST use `add_issue_comment` to post your plan. The workflow should end only after this tool call has been successfully formulated.
85+
3. **Post the Plan**: You MUST use `add_issue_comment` to post your plan. The workflow should end only after this tool call has been successfully formulated. Do not wait for human approval or check for comments; exit immediately after posting.
8686
8787
-----
8888

.github/commands/gemini-scheduled-triage.toml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,9 +85,15 @@ Iterate through each issue object. For each issue:
8585
8686
### Step 5: Construct and Write Output
8787
88-
Assemble the results into a single JSON array, formatted as a string, according to the **Output Specification** below. Finally, execute the command to write this string to the output file, ensuring the JSON is enclosed in single quotes to prevent shell interpretation.
89-
90-
- Use the shell command to write: `echo 'TRIAGED_ISSUES=...' > "$GITHUB_ENV"` (Replace `...` with the final, minified JSON array string).
88+
Assemble the results into a single JSON array, formatted as a string, according to the **Output Specification** below. Finally, execute the command to write this string to the output file.
89+
90+
- Use the shell command to write using a heredoc to prevent quote escaping issues:
91+
```bash
92+
cat << 'EOF' >> "$GITHUB_ENV"
93+
TRIAGED_ISSUES=...
94+
EOF
95+
```
96+
(Replace `...` with the final, minified JSON array string).
9197
9298
## Output Specification
9399
Lines changed: 54 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
name: 'Nightly Evaluations'
22

3+
permissions:
4+
contents: 'read'
5+
36
on:
47
schedule:
58
- cron: '0 1 * * *' # 1 AM UTC
@@ -11,15 +14,27 @@ on:
1114
default: '1'
1215

1316
jobs:
17+
list-evals:
18+
runs-on: 'ubuntu-22.04'
19+
outputs:
20+
matrix: '${{ steps.set-matrix.outputs.matrix }}'
21+
steps:
22+
- name: 'Checkout code'
23+
uses: 'actions/checkout@v4' # ratchet:exclude
24+
- id: 'set-matrix'
25+
run: |
26+
FILES=$(find evals -maxdepth 1 -name "*.eval.ts" | sort | jq -R -s -c 'split("\n")[:-1]')
27+
echo "matrix=${FILES}" >> "$GITHUB_OUTPUT"
28+
1429
evaluate:
30+
needs: 'list-evals'
1531
runs-on: 'ubuntu-22.04'
16-
permissions:
17-
contents: 'read'
1832
strategy:
1933
fail-fast: false
2034
matrix:
21-
model: ['gemini-3-pro-preview', 'gemini-3-flash-preview']
22-
name: 'Evaluate ${{ matrix.model }}'
35+
model: ['gemini-3-flash-preview']
36+
eval-file: '${{ fromJson(needs.list-evals.outputs.matrix) }}'
37+
name: 'Evaluate ${{ matrix.eval-file }} (${{ matrix.model }})'
2338

2439
steps:
2540
- name: 'Checkout code'
@@ -32,12 +47,14 @@ jobs:
3247
cache: 'npm'
3348

3449
- name: 'Install dependencies'
50+
# Retry logic for transient network or package retrieval failures
3551
run: |
3652
npm ci || (sleep 10 && npm ci) || (sleep 30 && npm ci)
3753
3854
- name: 'Install Gemini CLI'
55+
# Retry logic for transient network or package retrieval failures
3956
run: |
40-
npm install -g @google/gemini-cli@0.29.7 || (sleep 10 && npm install -g @google/gemini-cli@0.29.7) || (sleep 30 && npm install -g @google/gemini-cli@0.29.7)
57+
npm install -g @google/gemini-cli@latest || (sleep 10 && npm install -g @google/gemini-cli@latest) || (sleep 30 && npm install -g @google/gemini-cli@latest)
4158
4259
- name: 'Run Evaluations'
4360
id: 'run_evals'
@@ -46,16 +63,42 @@ jobs:
4663
GOOGLE_API_KEY: '${{ secrets.GOOGLE_API_KEY }}'
4764
GEMINI_MODEL: '${{ matrix.model }}'
4865
run: |
49-
npm run test:evals -- --reporter=json --outputFile=eval-results-${{ matrix.model }}.json || true
66+
BASE_NAME=$(basename "${{ matrix.eval-file }}" .eval.ts)
67+
npm run test:evals -- "${{ matrix.eval-file }}" --reporter=json --outputFile="eval-results-${{ matrix.model }}-${BASE_NAME}.json"
5068
5169
- name: 'Upload Results'
5270
if: 'always()'
5371
uses: 'actions/upload-artifact@v4' # ratchet:exclude
5472
with:
55-
name: 'eval-results-${{ matrix.model }}'
56-
path: 'eval-results-${{ matrix.model }}.json'
73+
name: 'eval-results-${{ matrix.model }}-${{ strategy.job-index }}'
74+
path: 'eval-results-${{ matrix.model }}-*.json'
5775

58-
- name: 'Job Summary'
59-
if: 'always()'
76+
report:
77+
needs: 'evaluate'
78+
if: 'always()'
79+
runs-on: 'ubuntu-22.04'
80+
steps:
81+
- name: 'Checkout code'
82+
uses: 'actions/checkout@v4' # ratchet:exclude
83+
84+
- name: 'Set up Node.js'
85+
uses: 'actions/setup-node@v4' # ratchet:exclude
86+
with:
87+
node-version: '20'
88+
cache: 'npm'
89+
90+
- name: 'Install dependencies'
91+
# Retry logic for transient network or package retrieval failures
92+
run: |
93+
npm ci || (sleep 10 && npm ci) || (sleep 30 && npm ci)
94+
95+
- name: 'Download Results'
96+
uses: 'actions/download-artifact@v4' # ratchet:exclude
97+
with:
98+
path: 'eval-results'
99+
pattern: 'eval-results-*'
100+
merge-multiple: true
101+
102+
- name: 'Aggregate All Results'
60103
run: |
61-
npx tsx scripts/aggregate_evals.ts "eval-results-${{ matrix.model }}.json" >> "$GITHUB_STEP_SUMMARY"
104+
npx tsx scripts/aggregate_evals.ts eval-results/*.json >> "$GITHUB_STEP_SUMMARY"

evals/data/gemini-assistant.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
"id": "fix-typo",
44
"inputs": {
55
"TITLE": "Fix typo in utils.js",
6-
"DESCRIPTION": "There is a typo in the helper function name.",
6+
"DESCRIPTION": "There is a typo in the helper function name. It should be 'newName' instead of 'oldName'.",
77
"EVENT_NAME": "issues",
88
"IS_PULL_REQUEST": "false",
99
"ISSUE_NUMBER": "10",

evals/gemini-assistant.eval.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ describe('Gemini Assistant Workflow', () => {
1818
it.concurrent(`should propose a relevant plan: ${item.id}`, async () => {
1919
const rig = new TestRig(`assistant-${item.id}`);
2020
try {
21+
rig.setupMockMcp();
2122
rig.initGit();
2223
rig.createFile(
2324
'utils.js',

evals/gemini-scheduled-triage.eval.ts

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ const dataset: ScheduledTriageCase[] = JSON.parse(
1616

1717
describe('Scheduled Triage Workflow', () => {
1818
for (const item of dataset) {
19-
it.concurrent(`should batch triage issues: ${item.id}`, async () => {
19+
it(`should batch triage issues: ${item.id}`, async () => {
2020
const rig = new TestRig(`scheduled-triage-${item.id}`);
2121
try {
2222
mkdirSync(join(rig.testDir, '.gemini/commands'), { recursive: true });
@@ -37,6 +37,12 @@ describe('Scheduled Triage Workflow', () => {
3737
const triagedLine = content
3838
.split('\n')
3939
.find((l) => l.startsWith('TRIAGED_ISSUES='));
40+
41+
if (!triagedLine) {
42+
console.error(
43+
`Failed to find TRIAGED_ISSUES in env file. stdout: ${stdout}`,
44+
);
45+
}
4046
expect(triagedLine).toBeDefined();
4147

4248
const jsonStr = triagedLine!.split('=', 2)[1];

evals/issue-fixer.eval.ts

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,48 @@ describe('Issue Fixer Workflow', () => {
2626
);
2727
rig.createFile(
2828
'package.json',
29-
'{"name": "test", "dependencies": {"lodash": "4.17.0"}}',
29+
'{"name": "test", "scripts": {"test": "echo \\"tests passed\\" && exit 0"}, "dependencies": {"lodash": "4.17.0"}}',
30+
);
31+
rig.createFile(
32+
'src/db/search.js',
33+
'export function searchUser(db, name) {\n const query = "SELECT * FROM users WHERE name = \'" + name + "\'";\n return db.query(query);\n}\n',
34+
);
35+
rig.createFile(
36+
'src/index.js',
37+
'function calculate(a, b) {\n return a + b;\n}\n\nfunction login(username, password) {\n if (password === "forgot password") throw new Error("crash");\n return true;\n}\n',
38+
);
39+
rig.createFile(
40+
'src/async.js',
41+
"async function fetchData() {\n return await api.get('/data');\n}\n",
42+
);
43+
rig.createFile(
44+
'src/ui/Component.tsx',
45+
"import React from 'react';\nexport const Component = () => {\n return <div>UI</div>;\n}\n",
46+
);
47+
rig.createFile(
48+
'src/utils/validation.ts',
49+
'export const validate = () => true;\n',
50+
);
51+
rig.createFile(
52+
'src/UserForm.tsx',
53+
"import React from 'react';\nexport const UserForm = () => {\n const isValid = true;\n return <form>User</form>;\n}\n",
54+
);
55+
rig.createFile(
56+
'src/OrderForm.tsx',
57+
"import React from 'react';\nexport const OrderForm = () => {\n const isValid = true;\n return <form>Order</form>;\n}\n",
58+
);
59+
rig.createFile(
60+
'test/UserProfile.test.js',
61+
'describe("UserProfile", () => {\n it("should load data", async () => {\n // Flaky network call\n });\n});\n',
62+
);
63+
64+
rig.createFile(
65+
'src/CheckoutWizard.tsx',
66+
'import React, { useState } from "react";\nexport const CheckoutWizard = () => {\n const [step, setStep] = useState(0);\n const nextStep = async () => {\n await new Promise(r => setTimeout(r, 100));\n setStep(s => s + 1);\n };\n return <button onClick={nextStep}>Next</button>;\n};\n',
67+
);
68+
rig.createFile(
69+
'scripts/deploy.js',
70+
'const fs = require("fs");\nif (fs.exists("dist")) {\n console.log("Deploying...");\n}\n',
3071
);
3172

3273
mkdirSync(join(rig.testDir, '.gemini/commands'), { recursive: true });

evals/issue-triage.eval.ts

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import { describe, expect, it } from 'vitest';
22
import { TestRig } from './test-rig';
3-
import { readFileSync, mkdirSync, copyFileSync } from 'node:fs';
3+
import { readFileSync, mkdirSync, copyFileSync, existsSync } from 'node:fs';
44
import { join } from 'node:path';
55

66
interface TriageCase {
@@ -18,7 +18,7 @@ const dataset: TriageCase[] = JSON.parse(readFileSync(datasetPath, 'utf-8'));
1818

1919
describe('Issue Triage Workflow', () => {
2020
for (const item of dataset) {
21-
it.concurrent(`should correctly triage: ${item.id}`, async () => {
21+
it(`should correctly triage: ${item.id}`, async () => {
2222
const rig = new TestRig(`triage-${item.id}`);
2323
try {
2424
// Setup the command
@@ -36,7 +36,16 @@ describe('Issue Triage Workflow', () => {
3636
GITHUB_ENV: envFile,
3737
};
3838

39-
await rig.run(['--prompt', '/gemini-triage', '--yolo'], env);
39+
const stdout = await rig.run(
40+
['--prompt', '/gemini-triage', '--yolo'],
41+
env,
42+
);
43+
44+
if (!existsSync(envFile)) {
45+
throw new Error(
46+
`envFile was not created at ${envFile}.\nStdout: ${stdout}\nStderr: ${rig.lastRunStderr}`,
47+
);
48+
}
4049

4150
// Check the output in GITHUB_ENV
4251
const content = readFileSync(envFile, 'utf-8');

evals/mock-mcp-server.ts

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,18 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
287287
},
288288
],
289289
};
290+
case 'issue_read':
291+
return {
292+
content: [
293+
{
294+
type: 'text',
295+
text: JSON.stringify({
296+
title: 'Mock Issue',
297+
body: 'This is a mock issue body.',
298+
}),
299+
},
300+
],
301+
};
290302
case 'issue_read.get_comments':
291303
return {
292304
content: [

0 commit comments

Comments
 (0)