[consensus] Add separate failure window for leader reputation by danielxiangzl · Pull Request #19323 · aptos-labs/aptos-core

danielxiangzl · 2026-04-03T20:54:55Z

Summary

Adds failure_window_num_validators_multiplier to ProposerAndVoterConfig, allowing the failure counting window to be configured independently from the proposer window
When set to 0 (default), falls back to proposer_window_num_validators_multiplier for full backward compatibility
When set to a larger value (e.g., 50), failures are remembered for ~4.7 min instead of ~56s, keeping chronically failing validators penalized

Motivation

The Oscillation Problem

The current leader reputation system uses a single proposer_window (~1,120 blocks, ~56s) for counting both successful and failed proposals. This creates an oscillation cycle for chronically unreliable validators:

Validator comes online briefly → votes → becomes active (weight 1000)
Gets selected as proposer → fails → weight drops to 1 (1000x penalty)
Goes offline → no votes, no proposals → failures age out of ~56s window → inactive (weight 10)
Comes back online → votes again → active (weight 1000) → gets selected → fails → repeat

The recovery to full active weight 1000 (not just inactive weight 10) means a flaky validator that's briefly online gets the same selection probability as a perfectly healthy one. The failure check (failure_rate > 10%) runs first and would block this, but the ~56s window is too short — failures age out while the validator is offline.

Mainnet Evidence (24h analysis, April 2026)

Three distinct failure patterns confirmed via Grafana (aptos_failed_proposals_in_window, aptos_committed_proposals_in_window, aptos_committed_votes_in_window):

Pattern 1 — Chronic offline oscillators (hashport, GCP, DSRV, sirouk-delegation):

Validator	Avg votes in window	Avg committed proposals	Avg failed proposals	% offline (votes=0)	Total failures/24h
hashport	0.46	1.47	0.36	~76%	2,065
DSRV	3.82	1.24	0.25	~41%	1,427
sirouk-delegation	2.76	6.54	0.22	—	1,136
GCP	0.46	1.11	0.04	~76%	255

These validators are mostly offline, come online intermittently, fail proposals, and repeat. They account for the vast majority of mainnet timeouts. The reputation system cannot keep them penalized because the ~56s window is shorter than their offline-online cycle.

Pattern 2 — Online but flaky (Stakely, amnis-delegation): Mostly online (avg votes 14-18), occasional failures. Current system handles them reasonably.

Pattern 3 — Healthy with rare failures (artifact, qyeah, bware-delegation): Always online (avg votes 25-80), very low failure rate. No oscillation.

Overall network impact: ~0.15% of rounds timeout (~0.027 timeouts/s at ~18 committed rounds/s). Each timeout wastes ~1s.

Why a Separate Failure Window

The root cause is that count_failed_proposals() and count_proposals() share the same window. Failures need to be remembered much longer than successes:

Success window (proposer_window, ~56s): Should stay short to answer "is this validator performing well RIGHT NOW?" — needs only ~10 proposal samples per validator
Failure window (new, ~4.7 min): Should be long to answer "has this validator failed RECENTLY?" — needs to survive the offline-online cycle

By decoupling these, the failure check (failure_rate > 10%) correctly fires even when a flaky validator comes back online and votes, because old failures are still in the longer failure window.

Note: this creates an asymmetry where cur_failed_proposals is counted over the longer failure window while cur_proposals is counted over the shorter proposer window. This makes the system more aggressive at penalizing — old failures are compared against only recent successes. This is the desired behavior for catching oscillating validators, but it's a subtle semantic difference from using a single window.

Alternative Approach: No Code Change

An alternative that achieves a similar effect with zero code changes (governance-only):

proposer_window_num_validators_multiplier: 10 → 50
failure_threshold_percent: 10 → 2

This increases the shared window to ~4.7 min (remembering failures longer) and lowers the threshold to compensate for the larger denominator (more successes dilute the failure rate). Tradeoffs vs the separate failure window:

Pro: Simpler, no code change, no window asymmetry — both successes and failures use the same time range
Con: The lower threshold (2%) makes the system more sensitive to transient failures. A healthy validator with 1 unlucky failure in 50 proposals (2%) would be penalized, whereas today it needs 1 in 10 (10%). This could cause false positives for Pattern 3 (healthy) validators.
Con: Larger DB fetch is always on — can't independently tune success vs failure windows

The separate failure window approach avoids the false positive issue because the proposer window stays short (10 proposals per validator), keeping the effective bar at "1 failure in ~10 recent proposals."

Limitation

While penalized (weight 1 vs 1000), a validator is ~1000x less likely to be selected, so it accumulates ~0 proposals. When failures eventually age out of even the longer window, it recovers to active(1000) through votes. The oscillation is slowed from ~56s to ~4.7 min but not fully eliminated. Complementary changes (e.g., lowering inactive_weight via governance) could further reduce this.

Design

count_failed_proposals() now uses a separate failure_window_size instead of sharing proposer_window_size
count_proposals() and count_votes() are unchanged — no impact on success/activity tracking
AptosDBBackend window size is max(proposer_window, voter_window, failure_window) to fetch enough history
Fully stateless and deterministic — no caching, all validators compute identical results from committed block history regardless of restart timing or processing order
#[serde(default)] on the new field ensures backward compatibility with existing on-chain config

Weight Assignment Logic (unchanged)

if cur_failed_proposals * 100 > (cur_proposals + cur_failed_proposals) * failure_threshold_percent {
    failed_weight      // 1   — failure check fires FIRST, uses longer failure window
} else if cur_proposals > 0 || cur_votes > 0 {
    active_weight      // 1000
} else {
    inactive_weight    // 10
}

Test plan

All 11 leader_reputation tests pass
cargo check -p aptos-consensus -p aptos-types passes
Forge test with failure_window_num_validators_multiplier: 50 to validate behavior
Governance proposal to enable on mainnet with tuned multiplier

🤖 Generated with Claude Code

Add failure_window_num_validators_multiplier to ProposerAndVoterConfig, allowing the failure counting window to be configured independently from the proposer window. This prevents the oscillation problem where flaky validators cycle through failed→inactive→active states because failures age out of the short proposer window (~56s) before the validator comes back online. When set to 0 (default), falls back to proposer_window_num_validators_multiplier for backward compatibility. When set to a larger value (e.g., 50), failures are remembered for ~4.7 min, keeping chronically failing validators penalized even when they briefly come back online and vote. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

danielxiangzl closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[consensus] Add separate failure window for leader reputation#19323

[consensus] Add separate failure window for leader reputation#19323
danielxiangzl wants to merge 1 commit into
mainfrom
daniel/strict-leader

danielxiangzl commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielxiangzl commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

The Oscillation Problem

Mainnet Evidence (24h analysis, April 2026)

Why a Separate Failure Window

Alternative Approach: No Code Change

Limitation

Design

Weight Assignment Logic (unchanged)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielxiangzl commented Apr 3, 2026 •

edited

Loading