Skip to content

[consensus] Add separate failure window for leader reputation#19323

Closed
danielxiangzl wants to merge 1 commit into
mainfrom
daniel/strict-leader
Closed

[consensus] Add separate failure window for leader reputation#19323
danielxiangzl wants to merge 1 commit into
mainfrom
daniel/strict-leader

Conversation

@danielxiangzl
Copy link
Copy Markdown
Contributor

@danielxiangzl danielxiangzl commented Apr 3, 2026

Summary

  • Adds failure_window_num_validators_multiplier to ProposerAndVoterConfig, allowing the failure counting window to be configured independently from the proposer window
  • When set to 0 (default), falls back to proposer_window_num_validators_multiplier for full backward compatibility
  • When set to a larger value (e.g., 50), failures are remembered for ~4.7 min instead of ~56s, keeping chronically failing validators penalized

Motivation

The Oscillation Problem

The current leader reputation system uses a single proposer_window (~1,120 blocks, ~56s) for counting both successful and failed proposals. This creates an oscillation cycle for chronically unreliable validators:

  1. Validator comes online briefly → votes → becomes active (weight 1000)
  2. Gets selected as proposer → fails → weight drops to 1 (1000x penalty)
  3. Goes offline → no votes, no proposals → failures age out of ~56s window → inactive (weight 10)
  4. Comes back online → votes again → active (weight 1000) → gets selected → fails → repeat

The recovery to full active weight 1000 (not just inactive weight 10) means a flaky validator that's briefly online gets the same selection probability as a perfectly healthy one. The failure check (failure_rate > 10%) runs first and would block this, but the ~56s window is too short — failures age out while the validator is offline.

Mainnet Evidence (24h analysis, April 2026)

Three distinct failure patterns confirmed via Grafana (aptos_failed_proposals_in_window, aptos_committed_proposals_in_window, aptos_committed_votes_in_window):

Pattern 1 — Chronic offline oscillators (hashport, GCP, DSRV, sirouk-delegation):

Validator Avg votes in window Avg committed proposals Avg failed proposals % offline (votes=0) Total failures/24h
hashport 0.46 1.47 0.36 ~76% 2,065
DSRV 3.82 1.24 0.25 ~41% 1,427
sirouk-delegation 2.76 6.54 0.22 1,136
GCP 0.46 1.11 0.04 ~76% 255

These validators are mostly offline, come online intermittently, fail proposals, and repeat. They account for the vast majority of mainnet timeouts. The reputation system cannot keep them penalized because the ~56s window is shorter than their offline-online cycle.

Pattern 2 — Online but flaky (Stakely, amnis-delegation): Mostly online (avg votes 14-18), occasional failures. Current system handles them reasonably.

Pattern 3 — Healthy with rare failures (artifact, qyeah, bware-delegation): Always online (avg votes 25-80), very low failure rate. No oscillation.

Overall network impact: ~0.15% of rounds timeout (~0.027 timeouts/s at ~18 committed rounds/s). Each timeout wastes ~1s.

Why a Separate Failure Window

The root cause is that count_failed_proposals() and count_proposals() share the same window. Failures need to be remembered much longer than successes:

  • Success window (proposer_window, ~56s): Should stay short to answer "is this validator performing well RIGHT NOW?" — needs only ~10 proposal samples per validator
  • Failure window (new, ~4.7 min): Should be long to answer "has this validator failed RECENTLY?" — needs to survive the offline-online cycle

By decoupling these, the failure check (failure_rate > 10%) correctly fires even when a flaky validator comes back online and votes, because old failures are still in the longer failure window.

Note: this creates an asymmetry where cur_failed_proposals is counted over the longer failure window while cur_proposals is counted over the shorter proposer window. This makes the system more aggressive at penalizing — old failures are compared against only recent successes. This is the desired behavior for catching oscillating validators, but it's a subtle semantic difference from using a single window.

Alternative Approach: No Code Change

An alternative that achieves a similar effect with zero code changes (governance-only):

proposer_window_num_validators_multiplier: 10 → 50
failure_threshold_percent: 10 → 2

This increases the shared window to ~4.7 min (remembering failures longer) and lowers the threshold to compensate for the larger denominator (more successes dilute the failure rate). Tradeoffs vs the separate failure window:

  • Pro: Simpler, no code change, no window asymmetry — both successes and failures use the same time range
  • Con: The lower threshold (2%) makes the system more sensitive to transient failures. A healthy validator with 1 unlucky failure in 50 proposals (2%) would be penalized, whereas today it needs 1 in 10 (10%). This could cause false positives for Pattern 3 (healthy) validators.
  • Con: Larger DB fetch is always on — can't independently tune success vs failure windows

The separate failure window approach avoids the false positive issue because the proposer window stays short (10 proposals per validator), keeping the effective bar at "1 failure in ~10 recent proposals."

Limitation

While penalized (weight 1 vs 1000), a validator is ~1000x less likely to be selected, so it accumulates ~0 proposals. When failures eventually age out of even the longer window, it recovers to active(1000) through votes. The oscillation is slowed from ~56s to ~4.7 min but not fully eliminated. Complementary changes (e.g., lowering inactive_weight via governance) could further reduce this.

Design

  • count_failed_proposals() now uses a separate failure_window_size instead of sharing proposer_window_size
  • count_proposals() and count_votes() are unchanged — no impact on success/activity tracking
  • AptosDBBackend window size is max(proposer_window, voter_window, failure_window) to fetch enough history
  • Fully stateless and deterministic — no caching, all validators compute identical results from committed block history regardless of restart timing or processing order
  • #[serde(default)] on the new field ensures backward compatibility with existing on-chain config

Weight Assignment Logic (unchanged)

if cur_failed_proposals * 100 > (cur_proposals + cur_failed_proposals) * failure_threshold_percent {
    failed_weight      // 1   — failure check fires FIRST, uses longer failure window
} else if cur_proposals > 0 || cur_votes > 0 {
    active_weight      // 1000
} else {
    inactive_weight    // 10
}

Test plan

  • All 11 leader_reputation tests pass
  • cargo check -p aptos-consensus -p aptos-types passes
  • Forge test with failure_window_num_validators_multiplier: 50 to validate behavior
  • Governance proposal to enable on mainnet with tuned multiplier

🤖 Generated with Claude Code

Add failure_window_num_validators_multiplier to ProposerAndVoterConfig,
allowing the failure counting window to be configured independently from
the proposer window. This prevents the oscillation problem where flaky
validators cycle through failed→inactive→active states because failures
age out of the short proposer window (~56s) before the validator comes
back online.

When set to 0 (default), falls back to proposer_window_num_validators_multiplier
for backward compatibility. When set to a larger value (e.g., 50),
failures are remembered for ~4.7 min, keeping chronically failing
validators penalized even when they briefly come back online and vote.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant