Skip to content

Prefer ConnectionOrigin::Outbound when picking peer for SyncInfoRequest #711

@nekomoto911

Description

@nekomoto911

Problem

EpochManager::request_sync_info (aptos-core/consensus/src/epoch_manager.rs:1825) selects a peer to send SyncInfoRequest by uniformly sampling all currently-connected peers, filtered only by network_id. ConnectionOrigin (inbound vs outbound) is not considered, so a fullnode can end up pulling from peers below it in the deployment tree:

Role Network Candidate composition Effect
VFN Vfn outbound to upstream validator only OK
PFN Public outbound seed + inbound downstream PFN picks downstream PFN at rate inbound / (inbound + outbound)
Ex-validator (NodeType::Validator, is_current_epoch_validator == false) Vfn only inbound from downstream VFNs always picks downstream VFN

The ex-validator case is the most damaging: the downstream VFN's state was originally synced from this very node, so the ex-validator ends up syncing against its own stale view — a closed loop that cannot catch up.

request_sync_info ticks every ~200 ms (GRAVITY_REQUEST_SYNC_INFO_INTERVAL_MS), so this is normal-operation behavior, not an attack-only scenario. In-set validators are unaffected — the function is gated by !is_current_epoch_validator.

Proposed change

Add an origin == Outbound filter to the candidate pool. If the outbound subset is non-empty, sample from it; otherwise fall back to the full pool (current behavior).

candidates = peers
  .filter(network_id == fullnode_side_network_id(node_type))
  .filter(origin == Outbound)             // new

if candidates.is_empty():
  candidates = <full candidate pool>      // fallback preserves liveness

Why the fallback: ConnectionOrigin is a reliable proxy for upstream/downstream only under a tree-shaped topology. In a generic p2p mesh, origin only reflects who dialed first, so an outbound-only policy would needlessly hurt liveness. Falling back gives outbound peers strict priority whenever any exist, without sacrificing progress when they don't.

Scope

In: EpochManager::request_sync_info (one call site).

Out (separate issue suggested): RoundManager::create_block_retriever non-validator branch shares the same network-id-only filter for BlockRetriever's candidate pool, but is driven by inbound consensus messages and has a preferred_peer concept that warrants its own analysis.

Open questions

  1. Where to surface ConnectionOrigin. get_available_peers() currently returns Vec<PeerNetworkId>; needs a small extension to expose origin from PeersAndMetadata / ConnectionMetadata.
  2. Feature flag. Recommend gating behind a NodeConfig boolean (default true) for easy rollback.
  3. Sanity-check origin semantics in NetworkBuilder before relying on them: validators only listen on Vfn (never dial out on it); PFN dials seeds outbound and accepts downstream PFNs inbound.

Metrics

Add counters outbound_peer_picked and fallback_to_inbound. A non-zero fallback_to_inbound is an early signal of ex-validator state or upstream outage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions