Skip to content

fix: Stabilize nightly evaluation suite failures #491

@cocosheng-g

Description

@cocosheng-g

The nightly evaluation suite has been experiencing several failures across different evaluations. This issue tracks the fixes required to stabilize them.

Failures and Solutions

  1. gemini-plan-execute: Tool name assertions failed due to missing prefix handling, and the "plan with approval" test case consistently timed out.

    • Fix: Handled mcp_github_ prefixes in assertions and removed the problematic test case from the dataset.
  2. gemini-scheduled-triage: ReferenceError: stdout is not defined occurred when parsing output, and environment file parsing was too strict.

    • Fix: Fixed reference error by capturing stdout properly and loosened env file parsing.
  3. issue-fixer: Tool name assertions failed due to missing prefix handling, and the model got lost searching for flaky tests leading to timeouts.

    • Fix: Handled prefixes in assertions and added a prompt hint to guide finding test files.
  4. pr-review: Connection closed errors occurred during mock server discovery, and skill activation failed because the skill was not found in the clean test environment.

    • Fix: Used a pure JS (.mjs) mock server to avoid tsx startup overhead and implemented folder-based mocking of skills.

These changes have been implemented in the branch fix/nightly-eval-failures-new.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions