Skip to content

e2e-tests: Fix bulletin_fetch flake by staging the relay snapshot per-validator#3249

Draft
BigTava wants to merge 1 commit into
mainfrom
tiago-fix-flaky-bulletin-e2e-test
Draft

e2e-tests: Fix bulletin_fetch flake by staging the relay snapshot per-validator#3249
BigTava wants to merge 1 commit into
mainfrom
tiago-fix-flaky-bulletin-e2e-test

Conversation

@BigTava
Copy link
Copy Markdown
Contributor

@BigTava BigTava commented May 8, 2026

Problem

bulletin_fetch flakes intermittently with a panic in zombienet-provider's archive.unpack().unwrap():

called `Result::unwrap()` on an `Err` value: Custom { kind: UnexpectedEof,
  error: TarError { desc: "failed to unpack `data/chains/.../db/full/000008.log`", ... } }

Reproduced 3/5 in a rerun loop on a single commit. Failure path was inside zombienet's local copy/read/extract pipeline.

Cause

alice and bob both passed the same relay tarball path to with_db_snapshot(...). zombienet-provider keys its internal extraction cache by sha256(path_string), so the two validators landed in the same cache slot and raced writing/reading the same intermediate <hash>.tgz in the namespace dir. When the read won the race against an in-progress write, tar parsing hit UnexpectedEof mid-entry.

…lake

Alice and bob both passed the same relay tarball path to with_db_snapshot. zombienet-provider keys its extraction cache by sha256(path), so they raced on the same intermediate file and one validator panicked mid-extract with UnexpectedEof. Per-validator copies hash to distinct cache slots.
@skunert
Copy link
Copy Markdown
Contributor

skunert commented May 12, 2026

Thanks! Can you maybe unify this with the other prefetchers we have in the tests?

I think the proper solution would be for zombiened-sdk to handle this gracefully, it seems totally normal to have multiple nodes reference the same snapshot. Opened an issue on zombienet-sdk, in the meantime we should keep the workarounds.

cc @pepoviola

@pepoviola
Copy link
Copy Markdown
Contributor

Hi @BigTava / @skunert, thanks for reporting this issue. Looks like in native provider we took the decision of not copy the snap per node and we can have the described race condition when multiple nodes use the same snap. I easy workaround to use until I draft a new release with this fixed is to set in the global settings spawn_concurrency = 1, so this will make the spawning logic sequential an should works as expected.

I will work in the issue and ping you for bumping the version of zombienet when is ready.

Thx!

@skunert
Copy link
Copy Markdown
Contributor

skunert commented May 13, 2026

Okay then lets use spawn_concurrency = 1 for now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants