Skip to content

[smoke-test] Decrease retries from 16 to 4#19301

Open
gregnazario wants to merge 4 commits into
aptos-labs:mainfrom
gregnazario:greg/cleanup-testing
Open

[smoke-test] Decrease retries from 16 to 4#19301
gregnazario wants to merge 4 commits into
aptos-labs:mainfrom
gregnazario:greg/cleanup-testing

Conversation

@gregnazario
Copy link
Copy Markdown
Contributor

Description

Reduces retries on testing, to bring out flaky tests / not spend so much time on them.

How Has This Been Tested?

Key Areas to Review

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Move Compiler
  • Other (specify)

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

@gregnazario gregnazario requested a review from a team as a code owner April 1, 2026 18:15
@gregnazario gregnazario enabled auto-merge (rebase) April 1, 2026 18:19
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment thread .config/nextest.toml
# Reduced from 3 to 1: combined with SWARM_BUILD_NUM_RETRIES=1, gives 2×2=4 total
# attempts per test (was 4×4=16). Fewer retries reduce resource contention when
# multiple tests compete for ports on the same CI instance.
retries = 1
Copy link
Copy Markdown
Contributor

@JoshLind JoshLind Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly question: if we reduce this, we won't allow a single smoke test flake anymore? Am I reading it correctly? 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 1 retry, so that's 2 tries

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... 🤔 This may be too aggressive without fixing the flakes first 😄 (I worry we'll block folks and make them unhappy).

Will unblock this for you now, and take a look at the latest set of partitioned smoke test runs and try to fix any that I understand 🙏

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Remove tests that were explicitly marked as subsets of their
epoch_changes counterparts. These tests were permanently #[ignore]'d
and provided no additional coverage over the active variants.

Removed from state_sync.rs:
- test_fullnode_output_sync_no_epoch_changes
- test_fullnode_execution_sync_no_epoch_changes
- test_validator_sync_and_participate_no_epoch_changes
- test_validator_fast_sync_and_participate_no_epoch_changes
- test_validator_fast_sync_exponential_backoff_no_epoch_changes

Removed from consensus_observer.rs:
- test_consensus_observer_fast_sync_no_epoch_changes
The Move Prover z3 version check was silently failing when an existing
z3 binary at $INSTALL_DIR couldn't report its version (e.g., wrong
glibc, corrupt download, incompatible binary from CI runner image).
This caused ~30% of unit test failures with:
  "cannot extract version from /home/runner/bin/z3"

Changes:
- Explicitly check if existing z3 binary can run and report version
- Delete and reinstall broken binaries instead of skipping install
- Add post-install verification to warn early on download corruption
@gregnazario gregnazario force-pushed the greg/cleanup-testing branch from 9b2ae23 to ab2312c Compare April 2, 2026 22:18
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment thread .config/nextest.toml
# Reduced from 3 to 1: combined with SWARM_BUILD_NUM_RETRIES=1, gives 2×2=4 total
# attempts per test (was 4×4=16). Fewer retries reduce resource contention when
# multiple tests compete for ports on the same CI instance.
retries = 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... 🤔 This may be too aggressive without fixing the flakes first 😄 (I worry we'll block folks and make them unhappy).

Will unblock this for you now, and take a look at the latest set of partitioned smoke test runs and try to fix any that I understand 🙏

@github-actions
Copy link
Copy Markdown
Contributor

This issue is stale because it has been open 45 days with no activity. Remove the stale label, comment or push a commit - otherwise this will be closed in 15 days.

@github-actions github-actions Bot added the Stale label May 20, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite compat success on 78a19e2458c9cded7eaf06d352b62bc18f295dff ==> 284e199b3e4c7868bfc822ae13045980e45eb6f4

Compatibility test results for 78a19e2458c9cded7eaf06d352b62bc18f295dff ==> 284e199b3e4c7868bfc822ae13045980e45eb6f4 (PR)
1. Check liveness of validators at old version: 78a19e2458c9cded7eaf06d352b62bc18f295dff
compatibility::simple-validator-upgrade::liveness-check : committed: 15081.30 txn/s, latency: 2299.89 ms, (p50: 2400 ms, p70: 2500, p90: 2900 ms, p99: 3400 ms), latency samples: 489000
2. Upgrading first Validator to new version: 284e199b3e4c7868bfc822ae13045980e45eb6f4
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6201.68 txn/s, latency: 5478.19 ms, (p50: 6000 ms, p70: 6100, p90: 6200 ms, p99: 6500 ms), latency samples: 212800
3. Upgrading rest of first batch to new version: 284e199b3e4c7868bfc822ae13045980e45eb6f4
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6267.40 txn/s, latency: 5375.17 ms, (p50: 5800 ms, p70: 6100, p90: 6200 ms, p99: 6300 ms), latency samples: 217620
4. upgrading second batch to new version: 284e199b3e4c7868bfc822ae13045980e45eb6f4
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9944.53 txn/s, latency: 3366.91 ms, (p50: 3700 ms, p70: 3800, p90: 4000 ms, p99: 4200 ms), latency samples: 324500
5. check swarm health
Compatibility test for 78a19e2458c9cded7eaf06d352b62bc18f295dff ==> 284e199b3e4c7868bfc822ae13045980e45eb6f4 passed
Test Ok

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite realistic_env_max_load success on 284e199b3e4c7868bfc822ae13045980e45eb6f4

two traffics test: inner traffic : committed: 14202.67 txn/s, latency: 1257.38 ms, (p50: 1200 ms, p70: 1300, p90: 1500 ms, p99: 1900 ms), latency samples: 5305120
two traffics test : committed: 100.01 txn/s, latency: 731.19 ms, (p50: 600 ms, p70: 700, p90: 900 ms, p99: 1600 ms), latency samples: 1740
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 0.484, avg: 0.434", "ConsensusProposalToOrdered: max: 0.121, avg: 0.114", "ConsensusOrderedToCommit: max: 0.199, avg: 0.174", "ConsensusProposalToCommit: max: 0.307, avg: 0.288"]
Max non-epoch-change gap was: 2 rounds at version 74097 (avg 0.00) [limit 4], 2.10s no progress at version 74097 (avg 0.06s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.27s no progress at version 2692692 (avg 0.27s) [limit 16].
Test Ok

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite framework_upgrade success on 78a19e2458c9cded7eaf06d352b62bc18f295dff ==> 284e199b3e4c7868bfc822ae13045980e45eb6f4

Compatibility test results for 78a19e2458c9cded7eaf06d352b62bc18f295dff ==> 284e199b3e4c7868bfc822ae13045980e45eb6f4 (PR)
Upgrade the nodes to version: 284e199b3e4c7868bfc822ae13045980e45eb6f4
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1624.05 txn/s, submitted: 1627.82 txn/s, failed submission: 3.77 txn/s, expired: 3.77 txn/s, latency: 1786.45 ms, (p50: 1200 ms, p70: 1200, p90: 2100 ms, p99: 12000 ms), latency samples: 146500
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2222.44 txn/s, submitted: 2232.26 txn/s, failed submission: 9.82 txn/s, expired: 9.82 txn/s, latency: 1272.51 ms, (p50: 1200 ms, p70: 1200, p90: 2100 ms, p99: 3000 ms), latency samples: 199142
5. check swarm health
Compatibility test for 78a19e2458c9cded7eaf06d352b62bc18f295dff ==> 284e199b3e4c7868bfc822ae13045980e45eb6f4 passed
Upgrade the remaining nodes to version: 284e199b3e4c7868bfc822ae13045980e45eb6f4
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2168.49 txn/s, submitted: 2175.07 txn/s, failed submission: 6.58 txn/s, expired: 6.58 txn/s, latency: 1342.37 ms, (p50: 1200 ms, p70: 1500, p90: 2300 ms, p99: 2900 ms), latency samples: 197781
Test Ok

auto-merge was automatically disabled May 21, 2026 00:29

Rebase failed

@github-actions github-actions Bot removed the Stale label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants