feat: devp2p peer jailing#2283
Conversation
…eer jailing) Replace the blanket peer-drop on sync failures with a graded, locally enforced response. Core grading: - Drop peers that serve objectively invalid or malformed data (invalid chain, bad-peer, invalid ancestor). - Soft-backoff peers that are merely slow, time out, or stall, or where malice cannot be proven; classification unwraps the error chain so a timeout wrapped in a bad-data sentinel is treated as transient, while a proven-bad-data sentinel is never downgraded by a stray deadline string. - Treat whitelist (checkpoint/milestone) mismatches as a distinct graded class: short backoff on first occurrence, local jail on repeat, and drop only after persistent mismatch within a 30-minute window. - Escalate repeated soft failures (4 within a 10-minute window) to a local jail, and persist backoffs across reconnects so the penalty cannot be reset by reconnecting under the same id. - Bench stalling peers past a grace period instead of dropping them, and keep sync live when every peer is stalled by unblocking the fetch loop via errPeerBackedOff. Robustness and hardening: - A pruned-sidechain ghost-state attack escalates (jail first, drop on a second within 30m) instead of jailing forever. - concurrentFetch timeouts route through the grading funnel exactly once (2-minute grace, no double strike); a master timeout aborts the cycle. - Every terminal drop records a 30-minute durable bench that survives reconnect, including the skeleton beacon-sync invalid-headers drop, and even when the offending peer has already departed (recordJailByID). - Per-peer strike and jail maps are hard-capped with oldest-strike / soonest-expiry eviction to bound peer-keyed allocation. - Non-peer-fault conditions never penalize the peer: node-shutdown cancellations (errCanceled / errCancelContentProcessing / errTerminated / errCancelStateFetch / snap.ErrCancelled) return early, benign swarm sentinels (errPeersUnavailable, errNoPeers) classify as no-action, and a peer that simply disconnects mid-sync (errDisconnected) is graded as a soft backoff rather than mis-classified as a bad-peer drop. - Cross-queue soft strikes are deduped via an atomic backoffForClaim, so a simultaneous multi-queue stall records one strike; skeleton header timeouts grade through the same backoff and mark the soft-backoff and jail meters. A peer benched by a skeleton timeout is requeued into the idle set so the scheduler arms its wake-up timer, and a timeout whose request was already reverted by a peer-leave is skipped via req.stale so a departed peer's same-id reconnect is never benched by a late timer. - peerWithHighestTD iterates the peer set under RLock to avoid a hot-path allocation; nextSyncOp arms the retry timer when no eligible peer is available and for the backoff expiry of benched higher-TD peers while in sync with a lower-TD one, and PeerBackoff consults the persisted jail map first to keep the peerSet/peerConnection lock order explicit. An errPeersUnavailable sync result (e.g. snap sync with only eth/68 peers) arms a short chainSyncer cooldown, cleared on any peer join/leave, so the loop does not spin in continuous sync setup/teardown without penalizing an otherwise-fine peer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
@claude review |
There was a problem hiding this comment.
Pull request overview
This PR introduces a graded peer-penalty mechanism for devp2p sync failures (soft backoff, local jail, and durable benching across reconnects) to reduce churn on transient issues while still escalating for persistent whitelist mismatches and pruned-sidechain “ghost-state” behavior. The changes span downloader peer response logic, backoff-aware peer selection/scheduling, and a large test suite (unit + p2p-level).
Changes:
- Add a peer failure classifier + response decision engine (backoff/jail/drop with strike windows and persistent benching across reconnects).
- Make sync scheduling and fetchers backoff-aware (skip benched peers and arm wake timers instead of busy retrying / disconnecting).
- Add extensive unit/integration-style coverage (including an end-to-end p2p fixture for the ghost-state case) plus additional logging/metrics.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
eth/sync.go |
Chain sync loop now supports retry waits and peers-unavailable cooldown handling. |
eth/sync_test.go |
Adds chainSyncer state/cooldown/backoff scheduling tests and helper setup. |
eth/protocols/eth/peer.go |
Initializes td to avoid nil TD usage. |
eth/protocols/eth/dispatcher.go |
Exports ErrDisconnected for downstream classification/backoff decisions. |
eth/peerset.go |
Peer selection can skip backed-off peers and returns a retry hint. |
eth/peerset_test.go |
Tests for backoff-aware highest-TD selection. |
eth/peer_jail_p2p_test.go |
End-to-end p2p test fixture validating ghost-state jailing behavior. |
eth/fetcher/tx_fetcher.go |
Enriches warn logs with origin/direct metadata. |
eth/downloader/skeleton.go |
Removes random request IDs, adds backoff wake timers, and backs off (vs hard-drop) on timeouts; benches invalid skeleton header senders. |
eth/downloader/skeleton_test.go |
Adds coverage for backoff timers, timeout backoff behavior, and invalid-header benching. |
eth/downloader/peer.go |
Introduces peer backoff state + persistent jail/strike tracking with bounded maps. |
eth/downloader/peer_response.go |
New failure classification + graded response routing (soft strikes, mismatch ladder, ghost-state ladder, durable benching). |
eth/downloader/peer_response_test.go |
Comprehensive tests for classification/decision/escalation, decay windows, bounded maps, and reconnect persistence. |
eth/downloader/metrics.go |
Adds meters for backoff/jail/drop/mismatch responses. |
eth/downloader/bor_fetchers.go |
Ensures header fetch respects downloader cancellation. |
eth/downloader/bor_fetchers_concurrent.go |
Reworks concurrent fetch to grade stalling peers via backoff/jail and to wake on backoff expiry. |
eth/downloader/bor_fetchers_concurrent_test.go |
Updates/adds tests for backed-off peers, departed stale grading, and master timeout abort behavior. |
eth/downloader/bor_fetch_helpers_test.go |
Adds focused tests for queue acceptance, idle peer collection, wake timers, and stall grading helpers. |
eth/downloader/bor_downloader.go |
Integrates peer-response handling into LegacySync, adds PeerBackoff helpers, improves error wrapping and bad-block reporting. |
eth/downloader/bor_downloader_test.go |
Updates attacker-drop expectations and adds tests for backoff helpers and ErrPeerBackedOff behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Large, design-level change to peer/sync behavior across ~20 files — worth a human reviewer eyeballing the tuning constants, the new error surface (ErrPeerBackedOff, ErrPeersUnavailable, ErrDisconnected exported), and the cross-component state machine, even though no bugs were flagged.
Extended reasoning...
Overview
This PR replaces the existing "drop peer on any classified sync failure" behavior with a graded response system spread across the downloader, skeleton syncer, concurrent fetcher, chain syncer, and peerset. It touches ~20 files, exports three previously-unexported error sentinels (ErrPeersUnavailable, ErrPeerBackedOff, ErrDisconnected), adds four new per-peer maps in peerSet (soft strikes, mismatch strikes, ghost strikes, jailed) with windowed counters and eviction, and introduces a peer-jail mechanism that persists across reconnects.
Security risks
No direct auth/crypto/permission surface. The main risk surface is denial-of-service resilience: the change makes the node less aggressive about dropping peers in some cases (transient failures get a 30s soft backoff instead of a disconnect) and more lenient on a first ghost-state mismatch (5 min jail rather than immediate drop). The threshold tuning (softFailureJailThreshold=4, whitelistMismatchDropThreshold=4, prunedSidechainDropThreshold=2, maxStrikeEntries=4096) determines whether a malicious peer can keep re-offering invalid chains within a window before being terminally dropped. A reviewer with networking context should sanity-check these against expected mainnet behavior.
Level of scrutiny
High. This is production-critical sync/networking logic that runs on every node, controls how the node reacts to its peers, and changes long-standing default behavior. It is not consensus-affecting per the description, but it changes how the local node interacts with the devp2p layer, and a tuning miscalibration could have operational consequences (e.g. peer churn under partial network conditions, or stickier jails than intended).
Other factors
The PR ships extensive test coverage — classifier, decision, routing, windowed counter decay/isolation/boundary tests, jail persistence, bounded-map eviction, backoff-aware peer selection, skeleton/concurrent-fetch grading, and an end-to-end TestFakeEthPeerPrunedSidechainIsLocallyJailed. The description claims go test -race and go vet are clean. That gives reasonable confidence in mechanical correctness, but the design-level decisions (threshold choices, the choice to grade ghost-state rather than drop on sight, the unified 30-minute drop bench) are the kind of judgment calls that benefit from human sign-off.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
This is a significant security-adjacent change that introduces a new peer-grading state machine across the downloader, peerset, skeleton, and chain syncer — I'm deferring to a human reviewer rather than auto-approving.
Extended reasoning...
Overview
This PR replaces the downloader's blanket peer-drop on classified sync failures with a graded response: classifySyncFailure → responseDecision → respondToPeer routes each failure to one of soft backoff, local jail, whitelist-mismatch escalation, ghost-state escalation, or terminal drop. New per-peer state (jailed, softStrikes, mismatchStrikes, ghostStrikes) is maintained in peerSet with windowed counters, oldest-entry eviction, and persistence across reconnect via Register. Backoff-aware peer selection is added to peerWithHighestTD, the skeleton syncer, and the concurrent fetcher, each with wake timers to avoid busy-polling. The cancellation path is also reworked so node shutdown no longer penalizes the master peer, and a new resettableTimer is added to the chain syncer loop. The diff touches 20 files spanning ~2,500 lines.
Security risks
The PR is not consensus-affecting and does not change validation or wire formats, but it materially changes the node's DoS posture toward its sync peers. Risks worth a careful look: (a) the strike-counter and jail maps are keyed by peer id and capped at maxStrikeEntries (4096) with oldest-entry eviction — under adversarial connection churn an attacker could potentially evict legitimate strikes; (b) the ghost-state classifier matches on a substring of the error string (isPrunedSidechainMismatch), which is brittle to upstream wording changes; (c) dropPeerForResponse persists a 30-minute bench across reconnects, meaning a misclassification has long-lived blast radius; (d) the master-peer timeout path now aborts the cycle with errTimeout and is graded as a transient backoff at the top level, which is a behavioral shift from the prior immediate drop.
Level of scrutiny
High. This is production-critical sync code that controls when peers are evicted, jailed, or trusted across reconnects. The constants (peerSoftBackoff, peerJailBackoff, peerDropBackoff, the strike thresholds, the 30-minute windows) are tuning decisions that warrant a human reviewer who understands the network's typical churn patterns on Polygon. The classifier ordering in classifySyncFailure (e.g., pruned-sidechain wins over wrapped deadline; whitelist-mismatch wins over invalid-chain) embeds non-obvious precedence assumptions that should get human eyes.
Other factors
Test coverage is extensive (the description lists ~50 new tests including an end-to-end TestFakeEthPeerPrunedSidechainIsLocallyJailed) and go test -race is reported clean. Copilot raised three substantive issues (stale comment in bor_downloader.go, error-attribution in importBlockResultsStateless, and zero-Sent shim requests in gradeStalePeer) and the author has addressed all three with follow-up commits. SonarQube passed. The bug-hunting system found nothing, but the size and criticality of this PR are exactly the case where the guidance says to defer — a thoughtful human reviewer is the right path here.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #2283 +/- ##
===========================================
+ Coverage 53.44% 53.69% +0.24%
===========================================
Files 896 897 +1
Lines 159770 160369 +599
===========================================
+ Hits 85397 86108 +711
+ Misses 69054 68920 -134
- Partials 5319 5341 +22
... and 31 files with indirect coverage changes
🚀 New features to boost your workflow:
|
Restore a WARN for no-op peer responses so peers-unavailable and whitelist no-remote stay visible instead of only logging at debug. Treat whitelist ErrNoRemote as a no-op and arm the retry cooldown instead of benching an honest peer that is merely lagging behind the latest milestone. Add peerGhostStateMeter for parity with the mismatch meter, and stop double-counting whitelist mismatches in the soft-backoff meter.
There was a problem hiding this comment.
Two findings from local AI review:
-
Blocker:
nextSyncOp()can deadlock on peer selection.peerWithHighestTD()holdsps.lock.RLock(), but the callback passed frometh/sync.gocalls back intoDownloader.PeerBackoff(), which takesps.lock.Lock()viapersistentBackoff(). That is the same mutex in one call chain, so sync can hang as soon as backed-off peers are considered. -
Medium:
isTransientFailure()is too broad. The fallbackstrings.Contains(err.Error(), context.DeadlineExceeded.Error())can classify wrapped hard failures as timeouts just because their text mentionscontext deadline exceeded, which can downgrade invalid-chain / bad-peer / invalid-ancestor cases into soft backoffs.
pratikspatil024
left a comment
There was a problem hiding this comment.
Looks good overall, just the 2 local AI points
Have provided answers to the comments here. I believe they are both false positives
|
|



Summary
On develop the downloader drops the offending peer on every classified sync failure and then resynchronises. That is right for genuinely bad data, but it also disconnects peers for transient problems (timeouts, stalls, a peer that is simply behind), which causes needless churn, and it does not actually help against a peer serving a pruned-sidechain "ghost-state" chain: once dropped it reconnects and immediately re-offers the same invalid chain, so the node churns on it. It also drops a peer for a checkpoint/milestone (whitelist) mismatch, even though that disagreement is frequently transient and the peer is otherwise useful.
This PR replaces the blanket drop with a graded response.
classifySyncFailuremaps each failure sentinel to apeerFailureReason,responseDecisionturns that reason into an action (none, soft backoff, local jail, drop, a whitelist-mismatch escalation, or a ghost-state escalation), andrespondToPeerapplies it throughjailPeer/backoffPeer/escalateMismatch/escalateGhostState/dropPeerForResponse. Genuinely bad data is still dropped. Transient failures get a short local backoff instead of a disconnect, and a peer that accumulates four soft strikes inside a ten minute window is escalated to the local jail. A whitelist mismatch is treated as its own graded class: a short backoff on the first strike, a local jail on the next, and a drop only after the mismatch persists within a thirty minute window, so a node briefly behind on checkpoints/milestones does not evict good peers while a genuinely divergent peer is still removed.The pruned-sidechain ghost-state case is also graded rather than dropped on sight: the first occurrence is locally jailed for five minutes, and only a second occurrence inside a thirty minute window escalates to a drop. This gives an honest peer caught in a legitimate pruned-sidechain reorg the benefit of the doubt while still removing a peer that keeps replaying the attack. Every terminal drop (invalid chain, bad peer, invalid ancestor, the fourth whitelist mismatch, the second ghost-state) now also records a thirty minute local bench, so a dropped peer that reconnects with the same node id stays benched instead of immediately looping back into sync. Conversely, a sync that fails only because the node itself cancelled or is shutting down is never charged against the peer.
errInvalidChainwrapping"sidechain ghost-state attack"errInvalidChainerrBadPeererrInvalidAncestorwhitelist.ErrMismatcherrTimeouterrStallingPeererrUnsyncedPeererrEmptyHeaderSeterrTooOldgradeStalePeer(errStallingPeer)errTimeout(master peer)errCanceled/errCancelContentProcessing/errTerminated(bare or wrapped)errPeersUnavailableerrPeerBackedOfferrInvalidBody,errInvalidReceipt, …)Soft backoff, jail, the whitelist-mismatch ladder, and the ghost-state ladder share one mechanism: a per-connection backoff plus a
jailedmap keyed by peer id (benchPeer/recordJail), withdropPeerForResponsefor the terminal drop. The four windowed strike counters are independent:recordSoftFailure,recordMismatch, andrecordGhostStateeach keep their own per-peer count and window, so the classes escalate independently. The differences are duration and thresholds: four soft strikes inside the ten minute window escalate to a five minute jail; a whitelist mismatch jails on the second strike and drops on the fourth inside a thirty minute window; a ghost-state jails on the first occurrence and drops on the second inside a thirty minute window.dropPeerForResponsebenches the peer for thirty minutes (peerDropBackoff) before disconnecting, so a terminal drop survives a same-id reconnect rather than resetting. All four per-peer maps (softStrikes,mismatchStrikes,ghostStrikes,jailed) are hard-capped atmaxStrikeEntrieswith oldest-entry eviction, so peer-keyed memory stays bounded under connection churn.Peer selection in the chain syncer (
peerWithHighestTD/nextSyncOp), the skeleton header syncer, and the concurrent fetcher skips a backed-off peer and arms a wake timer for when the backoff expires, so there is no busy-poll and no missed retry. The backoff is re-applied to a reconnecting peer inRegisterand its map entry is pruned once expired, so a peer can neither dodge an active jail by reconnecting nor stay jailed forever. In the concurrent fetcher a timed-out request is parked as stale and graded exactly once bygradeStalePeerafter a two minute grace; a peer that disconnects while a stale request is outstanding is graded on the way out so it cannot evade the strike, and a master-peer timeout aborts the cycle (and is then graded as a transient backoff at the top level).errPeerBackedOffis returned when a sync is attempted against an already backed-off peer and is treated as a clean early return, and a sync that ends in a cancellation sentinel (errCanceled,errCancelContentProcessing,errTerminated, bare or wrapped) is likewise returned without charging the peer.Executed tests
Classifier, decision, routing, and the cancellation guard:
TestClassifySyncFailureReasons,TestPeerResponseDecisionActions,TestRespondToPeerDropRoutesToDropPeer,TestIsSyncCancellationTestHandleSyncFailureClassified,TestHandleSyncFailureUnclassified,TestHandleSyncFailureSkipsUnknownPeer,TestHandleSyncFailureTimeoutBacksOff,TestDropPeerForResponseWithoutDropper,TestLiveOrCapturedPrefersRegisteredPeerTestClassifyPeerStallerBacksOff,TestBackoffEscalatesToJailAfterRepeatStrikes,TestRecordSoftFailureDecays,TestRecordSoftFailureIsolatesPeers,TestRecordSoftFailureWindowBoundaryPruned-sidechain ghost-state (jail then drop):
TestPrunedSidechainJailsPeer,TestPrunedSidechainEscalatesToDropTestRecordGhostStateDecays,TestRecordGhostStateIsolatesPeers,TestRecordGhostStateWindowBoundaryWhitelist mismatch escalation:
TestRecordMismatchDecays,TestRecordMismatchIsolatesPeers,TestRecordMismatchWindowBoundaryTestWhitelistMismatchDoesNotDropOnFirstStrike,TestWhitelistMismatchEscalatesBackoffJailDropJail/backoff persistence, durable drop, and bounded maps:
TestJailSurvivesReconnect,TestRecordJailPrunesExpiredEntries,TestRecordJailPrunesExpiredOnInterval,TestRecordJailPropagatesToLivePeer,TestRecordJailKeepsLongerExpiry,TestPersistentBackoffPrunesExpiredEntries,TestPeerBackoffStateTransitionsTestDropForResponseBenchesPeer,TestStrikeMapsAreBounded,TestStrikeEvictionRemovesOldest,TestJailEvictionRemovesSoonestExpiryBackoff-aware selection, wake timers, and concurrent-fetch grading:
TestQueueAcceptsPeer,TestCollectIdlePeersSurfacesBackoff,TestCollectIdlePeersAllStalePastGraceUnblocksExit,TestCollectIdlePeersWithinGraceKeepsWaiting,TestArmBackoffTimer,TestClassifyPeerBusyPeerNotIdle,TestFetchStallErrorTestGradeStalePeerClearsEntryAndStrikesOnce,TestConcurrentFetchReceipts_BackedOffPeer,TestConcurrentFetchGradesDepartedStalePeer,TestConcurrentFetchMasterTimeoutAbortsTestEarlierBackoff,TestSkeletonAssignTasksReportsBackoff,TestSkeletonSyncWakesAfterBackoff,TestSkeletonSyncBacksOffOnTimeoutTestChainSyncerNextSyncOpStates,TestChainSyncerNextSyncOpSkipsBackedOffPeer,TestChainSyncerLoopRetriesBackedOffPeer,TestResettableTimer,TestChainSyncerOnSyncDone,TestChainSyncerShutdownReturnsWithoutPendingSync,TestPeerWithHighestTDSkipsBackedOffPeersEnd-to-end:
TestFakeEthPeerPrunedSidechainIsLocallyJailedjails a fake eth peer on a real ghost-state mismatch and asserts it stays out of selection on reconnect.go build ./eth/...andgo vet ./eth/ ./eth/downloader/are clean.go test -race ./eth/ ./eth/downloader/is clean.Rollout notes
"Downloader: backing off peer","Downloader: locally jailing peer","Downloader: escalating repeated soft failures to local jail"; whitelist path:"Downloader: backing off peer after whitelist mismatch","Downloader: escalating repeated whitelist mismatch to local jail","Downloader: dropping peer after persistent whitelist mismatch"; ghost-state path:"Downloader: jailing peer for sidechain ghost-state attack","Downloader: dropping peer after repeated sidechain ghost-state attacks"; and the existing"Synchronisation failed, dropping peer"now carries areasonfield. Four new Prometheus meters:eth/downloader/peer/response/backoff,eth/downloader/peer/response/jail,eth/downloader/peer/response/drop, andeth/downloader/peer/response/mismatch. Behaviourally: a transient failure benches a peer for 30s instead of disconnecting it; a checkpoint/milestone mismatch is backed off and only dropped after persistent disagreement; the ghost-state case is jailed for five minutes and dropped only on a repeat; every drop now keeps the peer benched for thirty minutes across reconnects; and a sync cancelled by node shutdown no longer penalizes the master peer.peerSoftBackoff(30s),peerJailBackoff(5 min),peerDropBackoff(30 min),softFailureJailThreshold(4),softFailureWindow(10 min),whitelistMismatchWindow(30 min),whitelistMismatchJailThreshold(2),whitelistMismatchDropThreshold(4),prunedSidechainWindow(30 min),prunedSidechainDropThreshold(2), andmaxStrikeEntries(per-peer map cap).