[consensus] Add separate failure window for leader reputation#19323
Closed
danielxiangzl wants to merge 1 commit into
Closed
[consensus] Add separate failure window for leader reputation#19323danielxiangzl wants to merge 1 commit into
danielxiangzl wants to merge 1 commit into
Conversation
Add failure_window_num_validators_multiplier to ProposerAndVoterConfig, allowing the failure counting window to be configured independently from the proposer window. This prevents the oscillation problem where flaky validators cycle through failed→inactive→active states because failures age out of the short proposer window (~56s) before the validator comes back online. When set to 0 (default), falls back to proposer_window_num_validators_multiplier for backward compatibility. When set to a larger value (e.g., 50), failures are remembered for ~4.7 min, keeping chronically failing validators penalized even when they briefly come back online and vote. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
failure_window_num_validators_multipliertoProposerAndVoterConfig, allowing the failure counting window to be configured independently from the proposer windowproposer_window_num_validators_multiplierfor full backward compatibilityMotivation
The Oscillation Problem
The current leader reputation system uses a single
proposer_window(~1,120 blocks, ~56s) for counting both successful and failed proposals. This creates an oscillation cycle for chronically unreliable validators:The recovery to full active weight 1000 (not just inactive weight 10) means a flaky validator that's briefly online gets the same selection probability as a perfectly healthy one. The failure check (
failure_rate > 10%) runs first and would block this, but the ~56s window is too short — failures age out while the validator is offline.Mainnet Evidence (24h analysis, April 2026)
Three distinct failure patterns confirmed via Grafana (
aptos_failed_proposals_in_window,aptos_committed_proposals_in_window,aptos_committed_votes_in_window):Pattern 1 — Chronic offline oscillators (hashport, GCP, DSRV, sirouk-delegation):
These validators are mostly offline, come online intermittently, fail proposals, and repeat. They account for the vast majority of mainnet timeouts. The reputation system cannot keep them penalized because the ~56s window is shorter than their offline-online cycle.
Pattern 2 — Online but flaky (Stakely, amnis-delegation): Mostly online (avg votes 14-18), occasional failures. Current system handles them reasonably.
Pattern 3 — Healthy with rare failures (artifact, qyeah, bware-delegation): Always online (avg votes 25-80), very low failure rate. No oscillation.
Overall network impact: ~0.15% of rounds timeout (~0.027 timeouts/s at ~18 committed rounds/s). Each timeout wastes ~1s.
Why a Separate Failure Window
The root cause is that
count_failed_proposals()andcount_proposals()share the same window. Failures need to be remembered much longer than successes:By decoupling these, the failure check (
failure_rate > 10%) correctly fires even when a flaky validator comes back online and votes, because old failures are still in the longer failure window.Note: this creates an asymmetry where
cur_failed_proposalsis counted over the longer failure window whilecur_proposalsis counted over the shorter proposer window. This makes the system more aggressive at penalizing — old failures are compared against only recent successes. This is the desired behavior for catching oscillating validators, but it's a subtle semantic difference from using a single window.Alternative Approach: No Code Change
An alternative that achieves a similar effect with zero code changes (governance-only):
This increases the shared window to ~4.7 min (remembering failures longer) and lowers the threshold to compensate for the larger denominator (more successes dilute the failure rate). Tradeoffs vs the separate failure window:
The separate failure window approach avoids the false positive issue because the proposer window stays short (10 proposals per validator), keeping the effective bar at "1 failure in ~10 recent proposals."
Limitation
While penalized (weight 1 vs 1000), a validator is ~1000x less likely to be selected, so it accumulates ~0 proposals. When failures eventually age out of even the longer window, it recovers to active(1000) through votes. The oscillation is slowed from ~56s to ~4.7 min but not fully eliminated. Complementary changes (e.g., lowering
inactive_weightvia governance) could further reduce this.Design
count_failed_proposals()now uses a separatefailure_window_sizeinstead of sharingproposer_window_sizecount_proposals()andcount_votes()are unchanged — no impact on success/activity trackingAptosDBBackendwindow size ismax(proposer_window, voter_window, failure_window)to fetch enough history#[serde(default)]on the new field ensures backward compatibility with existing on-chain configWeight Assignment Logic (unchanged)
Test plan
leader_reputationtests passcargo check -p aptos-consensus -p aptos-typespassesfailure_window_num_validators_multiplier: 50to validate behavior🤖 Generated with Claude Code