Commit d2f0ca8
authored
mempool: per-entry TTL + bucket-sharded snapshot + silent-blackhole recovery (#722)
## Summary
Silent black-hole recovery for the mempool broadcast path — and adds an
e2e regression that **deterministically** exercises the recovery code
path.
Three commits:
1. **`feat(mempool):`** Replace global-wipe `TxnCache` with per-entry
TTL + bucket-sharded snapshot + four-state filter (`Dispatch /
WaitForPrimary / SuppressInTtl / SuppressSameSlot`). A tx whose first
dispatch landed on a stuck Primary slot is now auto-routed via the
Failover slot one TTL (default 5s) later — no gaptos changes, no global
re-broadcast, no extra tokio tasks.
2. **`test(e2e):`** Add Phase 3 to `pfn_chain` e2e. Restarts pfn1 with
`GRAVITY_BLACKHOLE_BROADCAST=1` (Mempool-side debug env knob, ~10 LOC in
`bin/gravity_node/src/mempool.rs`) so pfn1 stays a healthy member of
`sync_states` but silently drops mempool broadcasts; then drives 30s of
multi-account load via pfn3 and asserts impl-d's slot-flip catches every
in-flight tx within ~1 TTL + commit.
3. **`test(e2e):`** Refactor Phase 3 into **two back-to-back halves** —
Half A blackholes pfn1, Half B blackholes pfn2. pfn3 is never restarted
between halves so `priority.rs`'s `RandomState`-seeded Primary stays
fixed → exactly one half has `Primary == blackhole target` and exercises
slot-flip, the other half hits direct delivery latency. The cross-half
p95 split is a **deterministic** assertion — no more `--force-init`
coverage lottery.
## Phase 3 assertions
Per-half:
- `sent ≥ 150`, `timeout == 0`, `failed == 0`, `p99 ≤ 24s` (sanity
ceiling)
Cross-half (deterministic bimodal):
- `min(p95) ≤ 3.0s` — one half must run at direct delivery latency
- `max(p95) ≥ 6.0s` — other half must hit slot-flip (TTL=5s + commit
~1s)
- `gap ≥ 4.0s` — separation between the two modes
## Test plan
- [x] First clean run on `mempool-impl-d` worktree (no `--force-init`):
```
Half A (pfn1 blackhole): 210/0/0 p50=6.25 p95=8.61 p99=8.93
Half B (pfn2 blackhole): 233/0/0 p50=1.18 p95=2.05 p99=2.48
bimodal: fast=2.05 slow=8.61 gap=6.57 PASS
Total wall time: 6:16
```
Half A p50 = 6.25s ≈ TTL(5s) + commit(~1s) — slot-flip exactly at the
predicted floor.
- [ ] Reviewer: run `SKIP_CONTRACTS_FETCH=1 python3
gravity_e2e/runner.py pfn_chain` on a fresh worktree; expect either Half
A or Half B to land in the slow mode (Primary assignment is
RandomState-dependent across machines).
- [ ] Phase 1 / Phase 2 still pass unmodified.
- [ ] Verify `GRAVITY_BLACKHOLE_BROADCAST` env knob has no effect when
unset (default node behaviour unchanged).1 parent 2ccbee2 commit d2f0ca8
5 files changed
Lines changed: 841 additions & 88 deletions
File tree
- aptos-core/mempool/src/core_mempool
- bin/gravity_node/src
- cluster
- gravity_e2e
- cluster_test_cases/pfn_chain
- gravity_e2e/cluster
0 commit comments