Skip to content

Draft: stop handling remote DKGs in summary#9901

Draft
eichhorl wants to merge 14 commits intoeichhorl/restore-create-configs-earlyfrom
eichhorl/rm-summary-remote-dkg
Draft

Draft: stop handling remote DKGs in summary#9901
eichhorl wants to merge 14 commits intoeichhorl/restore-create-configs-earlyfrom
eichhorl/rm-summary-remote-dkg

Conversation

@eichhorl
Copy link
Copy Markdown
Contributor

No description provided.

@eichhorl eichhorl changed the base branch from master to eichhorl/create-configs-early April 16, 2026 11:46
@eichhorl eichhorl added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label Apr 16, 2026
github-merge-queue Bot pushed a commit that referenced this pull request Apr 20, 2026
…st_head_nns (#9936)

## Root cause

The test
`//rs/tests/consensus/orchestrator:ssh_access_to_nodes_test_head_nns`
runs 16 sub-tests sequentially (through `SystemTestGroup::add_test`)
after a ~2 minute `setup` phase. End-to-end the run takes roughly 14–15
minutes, which lands right at Bazel's `"long"` timeout (15 minutes / 900
seconds).

I inspected the two flaky runs from the last week:

* `2026-04-16T14:50:04` (master, commit `4548464`)
* `2026-04-16T16:29:48` (branch `eichhorl/rm-summary-remote-dkg`, PR
#9901)

In **both** cases every single sub-test completed with exit code 0 and
the timeout was hit right after the last sub-test
(`node_keeps_keys_until_it_completely_leaves_its_subnet`) finished,
while `assert_no_metrics_errors` was just starting:

```
2026-04-16 14:50:01.401 ... Task 'node_keeps_keys_until_it_completely_leaves_its_subnet' finished with exit code: Ok(ExitStatus(unix_wait_status(0)))
2026-04-16 14:50:01.409 ... >>> assert_no_metrics_errors_fn
-- Test timed out at 2026-04-16 14:50:03 UTC --
```

So nothing is actually broken with the test logic; the suite just
occasionally runs close enough to 15 min that it trips the Bazel
timeout.

## Fix

Bump the Bazel `test_timeout` of the `ssh_access_to_nodes_test` target
from the default `"long"` (15 min) to `"eternal"` (60 min). This gives
enough headroom for natural runtime variability without changing any
test semantics.

## Verification

Ran the test 3 times in parallel:

```
bazel test --test_output=errors --runs_per_test=3 --jobs=3 //rs/tests/consensus/orchestrator:ssh_access_to_nodes_test_head_nns
```

Result: all 3 runs passed (max 536.8s, min 503.2s, avg 514.9s).

---

This PR was created following the steps in
`.claude/skills/fix-flaky-tests/SKILL.md`.
Base automatically changed from eichhorl/create-configs-early to eichhorl/early-remote-nidkg-2 April 23, 2026 08:20
@eichhorl eichhorl changed the base branch from eichhorl/early-remote-nidkg-2 to eichhorl/restore-create-configs-early April 23, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI_ALL_BAZEL_TARGETS Runs all bazel targets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant