e2e-tests: Fix bulletin_fetch flake by staging the relay snapshot per-validator#3249
e2e-tests: Fix bulletin_fetch flake by staging the relay snapshot per-validator#3249BigTava wants to merge 1 commit into
Conversation
…lake Alice and bob both passed the same relay tarball path to with_db_snapshot. zombienet-provider keys its extraction cache by sha256(path), so they raced on the same intermediate file and one validator panicked mid-extract with UnexpectedEof. Per-validator copies hash to distinct cache slots.
bb1088c to
40b4755
Compare
|
Thanks! Can you maybe unify this with the other prefetchers we have in the tests? I think the proper solution would be for zombiened-sdk to handle this gracefully, it seems totally normal to have multiple nodes reference the same snapshot. Opened an issue on zombienet-sdk, in the meantime we should keep the workarounds. cc @pepoviola |
|
Hi @BigTava / @skunert, thanks for reporting this issue. Looks like in native provider we took the decision of not copy the snap per node and we can have the described race condition when multiple nodes use the same snap. I easy workaround to use until I draft a new release with this fixed is to set in the global settings I will work in the issue and ping you for bumping the version of zombienet when is ready. Thx! |
|
Okay then lets use spawn_concurrency = 1 for now, thanks! |
Problem
bulletin_fetchflakes intermittently with a panic inzombienet-provider'sarchive.unpack().unwrap():Reproduced 3/5 in a rerun loop on a single commit. Failure path was inside zombienet's local copy/read/extract pipeline.
Cause
aliceandbobboth passed the same relay tarball path towith_db_snapshot(...).zombienet-providerkeys its internal extraction cache bysha256(path_string), so the two validators landed in the same cache slot and raced writing/reading the same intermediate<hash>.tgzin the namespace dir. When the read won the race against an in-progress write, tar parsing hitUnexpectedEofmid-entry.