Skip to content

Eliminate get_committed_snapshots from join path.#7832

Draft
cjen1-msft wants to merge 18 commits intomicrosoft:mainfrom
cjen1-msft:join-with-fetch
Draft

Eliminate get_committed_snapshots from join path.#7832
cjen1-msft wants to merge 18 commits intomicrosoft:mainfrom
cjen1-msft:join-with-fetch

Conversation

@cjen1-msft
Copy link
Copy Markdown
Contributor

@cjen1-msft cjen1-msft commented Apr 21, 2026

Currently most of our code uses get_committed_snapshots implicitly via the _add_node or join_node calls.
This PR has two changes:

  • All startup and recovery runs now uses fetch_recent_snapshot (except tests which manually want that)
  • All join_node calls either have 'from_snapshot=False' or 'snapshot_dirs=..., from_snapshot=True', and asserts that no tests have just from_snapshot=True without 'snapshot_dirs=...'

This allows us to fully remove get_committed_snapshots from the startup path, pushing that out to the caller via snapshots_dir, this then paves the way for turning get_committed_snapshots into 'copy_snapshots_to_node_dir', bringing our testing closer to ACL's runtime.

I've left from_snapshot in in this PR to make it reviewable.
I'll submit another PR after this to remove it everywhere and instead just use snapshots_dir.

@cjen1-msft cjen1-msft added the run-long-test Run Long Test job label Apr 21, 2026
@cjen1-msft cjen1-msft marked this pull request as ready for review April 22, 2026 15:24
@cjen1-msft cjen1-msft requested a review from a team as a code owner April 22, 2026 15:24
Copilot AI review requested due to automatic review settings April 22, 2026 15:24
cjen1-msft and others added 2 commits April 22, 2026 16:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Python test infrastructure and multiple tests to stop implicitly calling get_committed_snapshots() as part of the default node join flow, instead making snapshot usage explicit via from_snapshot, snapshots_dir, and fetch_recent_snapshot parameters.

Changes:

  • Update many test call sites to pass from_snapshot=False explicitly when joining nodes, and pass snapshots_dir when joining from a copied snapshot.
  • Refactor some tests (notably tests/reconfiguration.py) to separate “copy snapshot locally” vs “fetch snapshot on startup” behaviors.
  • Adjust tests/infra/network.py snapshot-join logic and make get_committed_snapshots() accept node=None (defaulting to the primary).

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/infra/network.py Changes snapshot-join handling and get_committed_snapshots() defaulting behavior.
tests/reconfiguration.py Makes snapshot usage explicit and refactors test_add_node parameters.
tests/recovery.py Updates join calls to pass from_snapshot explicitly (and snapshots_dir where needed).
tests/lts_compatibility.py Branches join behavior by version, using copied snapshots for older versions.
tests/e2e_operations.py Updates several join paths and snapshot-related tests to be explicit about snapshot usage.
tests/redirects.py Updates joins to pass from_snapshot=False.
tests/partitions_test.py Updates join to be explicit about snapshot fetching behavior.
tests/limits.py Updates join to pass from_snapshot=False.
tests/governance.py Updates joins to pass from_snapshot=False.
tests/code_update.py Updates multiple joins to pass from_snapshot=False.
tests/e2e_logging.py Updates join to pass from_snapshot=False.
tests/e2e_common_endpoints.py Updates joins to pass from_snapshot=False.
tests/js-custom-authorization/custom_authorization.py Updates join to pass from_snapshot=False.
tests/infra/basicperf.py Ensures snapshot-join setup explicitly sets from_snapshot=True.

Custom instructions used:

  • .github/copilot-instructions.md
  • .github/instructions/reviewing.instructions.md

Comment thread tests/infra/network.py Outdated
Comment thread tests/infra/network.py Outdated
@achamayou
Copy link
Copy Markdown
Member

@cjen1-msft it feels like turning from_snapshot into an enum with No|FromFile|FromNode would be best for clarity. Apart from one residual test to confirm FromFile still works, and one with No to confirm retransmission also works, I would expect every other test to use the default FromNode.

@cjen1-msft
Copy link
Copy Markdown
Contributor Author

@achamayou I don't quite agree. If we propagated that all the way to the config, then yes.
But I think keeping our infra somewhat matching the underlying config is useful.
What I now want to do is to get rid of from_snapshot, and use just the snapshot_dir and fetch_recent_snapshot to express all of the options, as that matches directly what we have in the config => less magic.

Comment thread tests/infra/basicperf.py Outdated
Comment thread tests/infra/network.py
Comment thread tests/e2e_operations.py
new_node,
args.package,
args,
# Tmp fix until the primary expresses an opinion
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about why the primary does not express an opinion here. Do we not have backup download enabled?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the 'we need the primary to have an opinion over what snapshot it likes that is more precise than startup_seqno.

The reason being that if startup_seqno is 0, as it is here, then we replicate the whole ledger via Raft.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is #7844, right? I suggest we mention that inline in the comment here ("Tmp" is just a way to avoid our "TODO" check, the way we actually prevent this shim living forever is "link to the issue/PR that fixes it), and probably aim to merge that before this, so this shim never hits main.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddyashton Yep I'm going to delay merging this for a little bit until we merge #7844 and then this can be clean.

network.join_node(
new_node, args.package, args, from_snapshot=from_snapshot
)
if fetch_recent_snapshot:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make a kwargs dict, and pass it with **kwargs to avoid this type of duplication.

@cjen1-msft cjen1-msft marked this pull request as draft May 1, 2026 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-long-test Run Long Test job

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants