Problem
EpochManager::request_sync_info (aptos-core/consensus/src/epoch_manager.rs:1825) selects a peer to send SyncInfoRequest by uniformly sampling all currently-connected peers, filtered only by network_id. ConnectionOrigin (inbound vs outbound) is not considered, so a fullnode can end up pulling from peers below it in the deployment tree:
| Role |
Network |
Candidate composition |
Effect |
| VFN |
Vfn |
outbound to upstream validator only |
OK |
| PFN |
Public |
outbound seed + inbound downstream PFN |
picks downstream PFN at rate inbound / (inbound + outbound) |
Ex-validator (NodeType::Validator, is_current_epoch_validator == false) |
Vfn |
only inbound from downstream VFNs |
always picks downstream VFN |
The ex-validator case is the most damaging: the downstream VFN's state was originally synced from this very node, so the ex-validator ends up syncing against its own stale view — a closed loop that cannot catch up.
request_sync_info ticks every ~200 ms (GRAVITY_REQUEST_SYNC_INFO_INTERVAL_MS), so this is normal-operation behavior, not an attack-only scenario. In-set validators are unaffected — the function is gated by !is_current_epoch_validator.
Proposed change
Add an origin == Outbound filter to the candidate pool. If the outbound subset is non-empty, sample from it; otherwise fall back to the full pool (current behavior).
candidates = peers
.filter(network_id == fullnode_side_network_id(node_type))
.filter(origin == Outbound) // new
if candidates.is_empty():
candidates = <full candidate pool> // fallback preserves liveness
Why the fallback: ConnectionOrigin is a reliable proxy for upstream/downstream only under a tree-shaped topology. In a generic p2p mesh, origin only reflects who dialed first, so an outbound-only policy would needlessly hurt liveness. Falling back gives outbound peers strict priority whenever any exist, without sacrificing progress when they don't.
Scope
In: EpochManager::request_sync_info (one call site).
Out (separate issue suggested): RoundManager::create_block_retriever non-validator branch shares the same network-id-only filter for BlockRetriever's candidate pool, but is driven by inbound consensus messages and has a preferred_peer concept that warrants its own analysis.
Open questions
- Where to surface
ConnectionOrigin. get_available_peers() currently returns Vec<PeerNetworkId>; needs a small extension to expose origin from PeersAndMetadata / ConnectionMetadata.
- Feature flag. Recommend gating behind a
NodeConfig boolean (default true) for easy rollback.
- Sanity-check origin semantics in
NetworkBuilder before relying on them: validators only listen on Vfn (never dial out on it); PFN dials seeds outbound and accepts downstream PFNs inbound.
Metrics
Add counters outbound_peer_picked and fallback_to_inbound. A non-zero fallback_to_inbound is an early signal of ex-validator state or upstream outage.
Problem
EpochManager::request_sync_info(aptos-core/consensus/src/epoch_manager.rs:1825) selects a peer to sendSyncInfoRequestby uniformly sampling all currently-connected peers, filtered only bynetwork_id.ConnectionOrigin(inbound vs outbound) is not considered, so a fullnode can end up pulling from peers below it in the deployment tree:VfnPublicinbound / (inbound + outbound)NodeType::Validator,is_current_epoch_validator == false)VfnThe ex-validator case is the most damaging: the downstream VFN's state was originally synced from this very node, so the ex-validator ends up syncing against its own stale view — a closed loop that cannot catch up.
request_sync_infoticks every ~200 ms (GRAVITY_REQUEST_SYNC_INFO_INTERVAL_MS), so this is normal-operation behavior, not an attack-only scenario. In-set validators are unaffected — the function is gated by!is_current_epoch_validator.Proposed change
Add an
origin == Outboundfilter to the candidate pool. If the outbound subset is non-empty, sample from it; otherwise fall back to the full pool (current behavior).Why the fallback:
ConnectionOriginis a reliable proxy for upstream/downstream only under a tree-shaped topology. In a generic p2p mesh, origin only reflects who dialed first, so an outbound-only policy would needlessly hurt liveness. Falling back gives outbound peers strict priority whenever any exist, without sacrificing progress when they don't.Scope
In:
EpochManager::request_sync_info(one call site).Out (separate issue suggested):
RoundManager::create_block_retrievernon-validator branch shares the same network-id-only filter forBlockRetriever's candidate pool, but is driven by inbound consensus messages and has apreferred_peerconcept that warrants its own analysis.Open questions
ConnectionOrigin.get_available_peers()currently returnsVec<PeerNetworkId>; needs a small extension to expose origin fromPeersAndMetadata/ConnectionMetadata.NodeConfigboolean (defaulttrue) for easy rollback.NetworkBuilderbefore relying on them: validators only listen onVfn(never dial out on it); PFN dials seeds outbound and accepts downstream PFNs inbound.Metrics
Add counters
outbound_peer_pickedandfallback_to_inbound. A non-zerofallback_to_inboundis an early signal of ex-validator state or upstream outage.