Skip to content

fix(harness): make pfn_chain_stress robust on cross-FS / flaky-net hosts#720

Open
keanji-x wants to merge 1 commit into
Galxe:mainfrom
keanji-x:kj/fix-pfn-stress-harness
Open

fix(harness): make pfn_chain_stress robust on cross-FS / flaky-net hosts#720
keanji-x wants to merge 1 commit into
Galxe:mainfrom
keanji-x:kj/fix-pfn-stress-harness

Conversation

@keanji-x
Copy link
Copy Markdown
Contributor

@keanji-x keanji-x commented May 18, 2026

Three small defensive fixes hit while reproducing a mempool issue with the pfn_chain_stress harness on a host where /tmp is tmpfs and the repo lives on a separate device, and where GitHub fetches occasionally hiccup. None of these change runtime behavior of the binary under test — they only affect the harness's resilience on developer-machine variations.

1. cluster/deploy.sh — cross-device hardlink

gravity_node is hardlinked into each node's /tmp/gravity-cluster-pfn-*/<node>/bin/. When /tmp and target/ are on different filesystems the ln -f fails with "Invalid cross-device link" and tears down the whole setup.

Fall back to cp -f — matches what the gravity_cli path a few lines up already does (and the inline comment claims).

2. cluster/genesis.sh — flaky-network git fetch

Every run does git fetch origin + git pull origin <ref> on external/gravity_chain_core_contracts. A transient network error there kills the whole stress run.

Demote both to warnings — the local working copy is usually already at the right ref, and a stale local copy is strictly better than a hard failure for this workflow.

3. regression/pfn_chain_stress/run.sh — wait-for-chain set -e brittleness

The wait-for-chain loop does bn=$(curl ... | sed ...) under set -euo pipefail. The first probe typically lands while node1's RPC is still starting; curl exits 7, pipefail propagates it through the command substitution, and set -e kills the script before the chain is even up.

Wrap curl in { ... || true; } so the wait loop actually waits.

Test plan

  • Ran ./run.sh pfn3 --clean end-to-end on a host where all three failure modes were hit; harness now completes cleanly through bench + TPS analysis.
  • Verify no regression on a 'pristine' host where /tmp and target/ are same FS (no functional change there).

🤖 Generated with Claude Code

Three small defensive fixes hit while reproducing a mempool issue with
the pfn_chain_stress harness on a host where /tmp is tmpfs and the repo
lives on a separate device, and where GitHub fetches occasionally hiccup.

1. cluster/deploy.sh: gravity_node binary is hardlinked into each node's
   /tmp/gravity-cluster-pfn-*/<node>/bin/. When /tmp and target/ are on
   different filesystems the `ln -f` fails with "Invalid cross-device
   link" and tears down the whole setup. Fall back to `cp -f` — matches
   what the gravity_cli path already does a few lines up.

2. cluster/genesis.sh: every run does `git fetch origin` + `git pull
   origin <ref>` on the external contracts repo. A transient network
   error there kills the whole stress run. Demote both to warnings —
   the local working copy is usually already at the right ref and a
   stale local copy is strictly better than a hard failure for this
   workflow.

3. regression/pfn_chain_stress/run.sh: the wait-for-chain loop does
   `bn=$(curl ... | sed ...)` under `set -euo pipefail`. The first probe
   typically lands while node1's RPC is still starting; curl exits 7,
   pipefail propagates it through the command substitution, and set -e
   kills the script before the chain is even up. Wrap curl in `{ ...
   || true; }` so the wait loop actually waits.

None of these change runtime behavior of the binary under test — they
only affect the harness's resilience on developer-machine variations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant