Skip to content

fix(mempool): honor sender_bucket / timeline_id / count / before in read_timeline#719

Open
keanji-x wants to merge 1 commit into
Galxe:mainfrom
keanji-x:kj/fix-mempool-read-timeline
Open

fix(mempool): honor sender_bucket / timeline_id / count / before in read_timeline#719
keanji-x wants to merge 1 commit into
Galxe:mainfrom
keanji-x:kj/fix-mempool-read-timeline

Conversation

@keanji-x
Copy link
Copy Markdown
Contributor

@keanji-x keanji-x commented May 17, 2026

Summary

CoreMempool::read_timeline was ignoring every routing parameter (sender_bucket, timeline_id, count, before, priority_of_receiver) and using a single global TxnCache to dedup across consecutive ticks. The result: the first call of any tick drained every broadcastable txn from the reth pool into one batch, and every subsequent call (other buckets, Failover priority, even other peers) returned empty until the cache's TTL/capacity expired ~60s later.

Observable bugs

  • sender_bucket fan-out across PFNs collapsed. Whichever peer the broadcast loop iterated first this tick swallowed the entire pool; the other PFN got nothing for ~60s.
  • Failover delay (shared_mempool_failover_delay_ms) bypassed. before was ignored, so Failover peers received txns with the same timing as Primary.
  • shared_mempool_batch_size cap bypassed. A single broadcast batch could contain the entire pool.
  • Peer-side resume state broken. The returned MultiBucketTimelineIndexIds.id_per_bucket had length equal to the txn count rather than the number of fee buckets, so PeerSyncState::timelines never advanced meaningfully.

Fix

Replace TxnCache with a BroadcastIndex that lazily mirrors the reth pool's broadcastable set, assigning each txn a monotonic timeline_id within its sender_bucket on first observation. read_timeline now:

  • Filters by sender_bucket (last-byte-mod-N, matching transaction_store::sender_bucket in upstream Aptos).
  • Resumes after the input timeline_id (max across positions; this PR does not subdivide by fee bucket internally — see follow-ups).
  • Caps at count.
  • Skips txns first observed after before and breaks the scan on the first too-fresh entry (timeline_id is monotonic with insertion time, so the tail is also too fresh).
  • Returns a MultiBucketTimelineIndexIds whose length matches the caller's input so PeerSyncState stays well-formed.

GC: entries are dropped when their txn no longer appears in the snapshot from the upstream pool (committed, expired, or evicted).

Follow-ups (intentionally out of scope)

  • Per-(sender_bucket, fee_bucket) subdivision. Requires exposing gas price from TxPool::get_broadcast_txns. Current PR collapses all fee buckets into one timeline, which loses high-fee-first ordering within a broadcast batch but matches the reth pool's natural ordering.
  • Accurate ready_time emission. Needs SystemTime tracking in addition to Instant. Left at 0 to match prior behavior.
  • timeline_range retransmit path. Still returns empty (unchanged from before); orthogonal to this fix.

Test plan

  • cargo check -p aptos-mempool — passes
  • cargo test -p aptos-mempool --lib — 4 new unit tests pass:
    • sender_bucket_takes_last_byte_mod_n (sender_bucket arithmetic)
    • broadcast_index_partitions_by_sender_bucket (bucket partitioning)
    • range_query_respects_resume_id (BTreeMap::range resume semantics)
    • failover_before_filter_breaks_scan (before-driven scan break)
  • Validator/PFN broadcast smoke: confirm both PFNs receive a balanced share of new txns under load
  • Confirm shared_mempool_failover_delay_ms is honored on Failover peers (Failover broadcasts strictly lag Primary by configured delay)
  • Watch for regressions in the per-peer distribution of the broadcast metric

🤖 Generated with Claude Code

…ead_timeline

The previous CoreMempool::read_timeline ignored every routing parameter
and used a single global TxnCache to dedup across consecutive ticks.
That meant the first read_timeline call of a tick drained every
broadcastable txn from the reth pool, populated the cache, and every
subsequent call (other buckets, Failover priority, even other peers)
saw an empty result until TTL/capacity cleared the cache. Effects:

- The sender_bucket-based fan-out across PFNs collapsed: whichever peer
  iterated first this tick received the entire pool, the others got
  nothing for ~60s.
- The Failover delay (`shared_mempool_failover_delay_ms`) was bypassed,
  since `before` was ignored.
- The `shared_mempool_batch_size` cap was bypassed: a tick could push
  the whole pool into a single broadcast batch.
- The returned MultiBucketTimelineIndexIds used `id_per_bucket` length
  equal to the txn count rather than the number of fee buckets, so the
  peer-side resume state never advanced meaningfully.

Replace TxnCache with a BroadcastIndex that lazily mirrors the reth
pool's broadcastable set and assigns each txn a monotonic timeline_id
within its sender_bucket on first observation. read_timeline now:

- Returns only txns whose sender hashes into the requested
  sender_bucket (last-byte-mod-N, matching upstream Aptos's
  `transaction_store::sender_bucket`).
- Resumes after the input timeline_id (max position; we don't subdivide
  by fee bucket).
- Caps the batch at `count`.
- Skips txns first observed after `before` (Failover delay gate), and
  breaks on the first too-fresh entry since timeline_id is monotonic
  with insertion time.
- Returns a MultiBucketTimelineIndexIds whose length matches the input
  so PeerSyncState stays well-formed.

GC: entries are dropped when their txn no longer appears in the
snapshot from the upstream pool (committed, expired, or evicted).

Per-(sender_bucket, fee_bucket) subdivision and accurate ready_time
emission are intentional follow-ups: both require richer txn metadata
from the TxPool trait (gas price for fee buckets, SystemTime for
ready_time), which is out of scope for this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nekomoto911
Copy link
Copy Markdown
Contributor

Tracing through the implementation with the broadcast design notes — the cursor restoration itself (sender_bucket sharding + per-(peer, bucket) cursor + before filter) is a faithful upstream restore. A few gaps worth surfacing:

1. No retransmit path → fault-tolerance regression vs main

With the strict cursor honored, a single un-committing txn X is broadcast at most twice (Primary at t≈0, Failover at t≈500ms) and never again in steady state, because:

  • timeline_range_of_message still returns vec![] (mempool.rs:94-102) → sent_messages is wiped at the top of determine_broadcast_batch every tick → L1 ack-timeout retransmit never fires.
  • gc() / gc_by_expiration_time() still empty (mempool.rs:133-135, 168-170) → L4 system_ttl=600s doesn't apply; reth's max_tx_lifetime only scans queued, not pending (maintain.rs:284-296).
  • The previous TxnCache 60s wipe — which at least gave silent-black-hole a periodic retry window — is removed.

Net effect: silent-black-hole (Primary's TCP/ACK/ping healthy but downstream broken) becomes non-recoverable from the mempool side; X stays in reth pending until capacity eviction (hours–days). Worse than main on this axis.

Suggested follow-up: implement timeline_range_of_message so L1 works; ideally a finite gc() for the L4 backstop.

2. No snapshot amortization across same-tick calls

pool.get_broadcast_txns(None).collect() plus full HashMap rebuild runs on every read_timeline call (mempool.rs:165-174). With num_sender_buckets=4 × default_failovers=1 × 2-peer, that's ~8 full pool scans per 10ms tick. Sub-ms at ≤10k; bottleneck at 100k. Easy follow-up: cache the snapshot on Mempool for ~tick interval and share across same-tick calls.

3. insertion_time is "first-observed", not pool insertion

mempool.rs:215 — Failover's before filter uses this. On mempool restart, every in-pool txn gets insertion_time = now, so Failover loses its catch-up role for the first 500ms after restart. Minor in production; worth adding to the follow-ups list.

4. Nit: saturating_add should be checked_add

next_ids[sb_idx].saturating_add(1) (mempool.rs:213) saturates at u64::MAX, after which every new txn shares the same id and the BTreeMap overwrites silently. Practically unreachable, but checked_add(1).expect("timeline_id overflow") makes the invariant explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants