Skip to content

Stabilize flaky e2e shard failures#580

Closed
cursor[bot] wants to merge 1 commit into
masterfrom
cursor/ci-pipeline-failure-a849
Closed

Stabilize flaky e2e shard failures#580
cursor[bot] wants to merge 1 commit into
masterfrom
cursor/ci-pipeline-failure-a849

Conversation

@cursor
Copy link
Copy Markdown
Contributor

@cursor cursor Bot commented Mar 30, 2026

Summary

This PR fixes the two remaining CI shard failures from commit f31c29a72fa33fd3862c1f6ed2ee82f27ded7f3b after checking for existing open PRs.

  • Confirmed e2e-tool-config-real-apis is already covered by open draft PR Fix flaky tool-config real API auto-run test #576 and did not duplicate that work.
  • Hardened e2e/tests/tool-config/mock-api/tool-call-policies.spec.ts so the manual-approval + auto-run assertion waits for the rendered Read Channel tool card instead of assuming the latest bot post is the follow-up tool call.
  • Hardened system-console login handling by allowing tests to stop after authenticated navigation when they immediately redirect to plugin config, and updated the flaky should disable reasoning configuration spec to use that mode.

QA / verification:

  • Checked existing open PRs with gh pr list / gh pr view before proceeding.
  • Reviewed failed job logs with:
    • gh run view --job 69224831269 --log-failed --repo mattermost/mattermost-plugin-agents
    • gh run view --job 69224831156 --log-failed --repo mattermost/mattermost-plugin-agents
    • gh run view --job 69224831148 --log-failed --repo mattermost/mattermost-plugin-agents
  • cd e2e && npx playwright test tests/tool-config/mock-api/tool-call-policies.spec.ts tests/system-console/bot-reasoning-config.spec.ts --project=chromium --list
  • Attempted targeted Playwright execution for the affected specs, but this automation runner cannot start testcontainers locally (Could not find a working container runtime strategy), so runtime verification is deferred to CI.

Ticket Link

None

Screenshots

None

Release Note

NONE
Open in Web View Automation 

Co-authored-by: Christopher Speller <crspeller@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

🤖 LLM Evaluation Results

OpenAI

⚠️ Overall: 18/19 tests passed (94.7%)

Provider Total Passed Failed Pass Rate
⚠️ OPENAI 19 18 1 94.7%

❌ Failed Evaluations

Show 1 failures

OPENAI

1. TestReactEval/[openai]_react_cat_message

  • Score: 0.00
  • Rubric: The word/emoji is a cat emoji or a heart/love emoji
  • Reason: The output is the text "smiley_cat", not an actual cat emoji (e.g., 🐱/😺) or a heart/love emoji (e.g., ❤️/💕).

Anthropic

⚠️ Overall: 18/19 tests passed (94.7%)

Provider Total Passed Failed Pass Rate
⚠️ ANTHROPIC 19 18 1 94.7%

❌ Failed Evaluations

Show 1 failures

ANTHROPIC

1. TestReactEval/[anthropic]_react_cat_message

  • Score: 0.00
  • Rubric: The word/emoji is a cat emoji or a heart/love emoji
  • Reason: The output is the text string "heart_eyes_cat", not an actual cat emoji (e.g., 😺/🐱) or a heart/love emoji (e.g., ❤️/😍).

This comment was automatically generated by the eval CI pipeline.

@mattermost-build
Copy link
Copy Markdown
Collaborator

This PR has been automatically labelled "stale" because it hasn't had recent activity.
A core team member will check in on the status of the PR to help with questions.
Thank you for your contribution!

@crspeller crspeller closed this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants