Skip to content

[consensus] Latency-weighted leader reputation + tighter classifier (combined)#19341

Draft
danielxiangzl wants to merge 10 commits into
daniel/latency-baseline-by-skipfrom
daniel/latency-weighted-leader-by-skip
Draft

[consensus] Latency-weighted leader reputation + tighter classifier (combined)#19341
danielxiangzl wants to merge 10 commits into
daniel/latency-baseline-by-skipfrom
daniel/latency-weighted-leader-by-skip

Conversation

@danielxiangzl
Copy link
Copy Markdown
Contributor

@danielxiangzl danielxiangzl commented Apr 6, 2026

Summary

Combines two complementary improvements to leader reputation:

  1. Latency-weighted heuristic — continuous, per-validator weight scaling that prefers validators with lower historical commit-to-commit interval as proposers
  2. Tighter binary classifierfailed_weight=0, failure_threshold_percent=5 so a slow validator is reliably banned

Gated behind a new on-chain config variant so all validators deterministically agree on whether to enable it (no rollout fork).

Heuristic

Splits successful pair intervals 50/50 between newer/older proposer; attributes timeout-spanning gaps to the failed proposer(s) via failed_proposer_indices. Aggregates with mean (not median — median discarded the failure tail). Healthy adjacents are no longer wrongly penalized for absorbing others' timeouts.

Guards:

  • Per-validator fallback below MIN_OBSERVATIONS=2
  • Scaling ratio clamped at MAX_LATENCY_RATIO=10×
  • Empty / zero-mean → falls back to base weights

On-chain gate

New LeaderReputationType::ProposerAndVoterV3(ProposerAndVoterConfigV3) with use_latency_weighted: bool and latency_weight_multiplier_milli: u32 (BCS-friendly milli-units; 1000 = 1.0×). V1/V2 default the toggle to false → no behavior change for existing payloads.

Forge config

ProposerAndVoterV3:
  base:
    failed_weight: 0                                   // bumped from 1 (banned)
    failure_threshold_percent: 5                       // bumped from 10
    proposer_window_num_validators_multiplier: 100     // bumped from 10
    ...
  use_latency_weighted: true
  latency_weight_multiplier_milli: 2000                // 2.0× suppression

Tests

7 new unit tests for the heuristic (50/50 split, failure attribution, multi-failure split, per-validator fallback, empty history, ratio clamp, cross-epoch skip). All 18 leader-reputation lib tests pass.

Experiment ladder

PR window threshold failed_weight heuristic
#19330 baseline 10× 10% 1 off
#19574 window-only 100× 10% 1 off
#19566 classifier 100× 5% 0 off
#19567 heuristic-only 100× 10% 1 on (2×)
this PR combined 100× 5% 0 on (2×)

Forge results (run 2026-04-28 21:12-21:26 UTC, 14 min, 4k TPS, 1 slow validator)

Commit-accepted latency vs baseline #19330:

p50 p75 p90 p99
#19330 baseline 0.209 0.298 0.658 1.268
#19574 window-only 0.220 0.310 0.685 1.281
#19566 classifier 0.191 0.255 0.337 1.207
#19567 heuristic-only 0.179 0.229 0.299 1.064
this PR combined 0.176 0.225 0.294 0.901
Δ vs baseline −16% −24% −55% −29%

Winner across every percentile. The classifier provides a strong p90 floor; the heuristic flattens the p99 tail. Together they yield −29% p99 (1.27→0.90) vs. baseline.

Conclusions

Test plan

⚠ Prototype/experiment code — not for merge to main as-is. Canonical merge requires governance migration of the on-chain config.

🤖 Generated with Claude Code

@danielxiangzl danielxiangzl added the CICD:run-forge-e2e-perf Run the e2e perf forge only label Apr 6, 2026
@danielxiangzl danielxiangzl changed the base branch from main to daniel/latency-baseline-by-skip April 6, 2026 18:03
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@danielxiangzl danielxiangzl force-pushed the daniel/latency-weighted-leader-by-skip branch from 490aaf3 to 31cee9e Compare April 6, 2026 20:17
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@danielxiangzl danielxiangzl force-pushed the daniel/latency-weighted-leader-by-skip branch from 31cee9e to 94ff966 Compare April 6, 2026 22:17
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@danielxiangzl danielxiangzl force-pushed the daniel/latency-baseline-by-skip branch from 4485f9d to 30a41f4 Compare April 6, 2026 23:50
@danielxiangzl danielxiangzl force-pushed the daniel/latency-weighted-leader-by-skip branch from 94ff966 to 742552e Compare April 6, 2026 23:52
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@danielxiangzl danielxiangzl force-pushed the daniel/latency-weighted-leader-by-skip branch from 742552e to be75836 Compare April 14, 2026 22:23
@zeropath-aptos
Copy link
Copy Markdown

No security or compliance issues detected. Reviewed everything up to be75836.

Security Overview
Detected Code Changes
Change Type Relevant files
Configuration changes ► .github/workflows/docker-build-test.yaml
    Reduce FORGE_RUNNER_DURATION_SECS from 1800 to 900
► config/src/config/consensus_config.rs
    Add use_latency_weighted_leader and latency_weight_multiplier to ConsensusConfig
► testsuite/forge-cli/src/suites/land_blocking.rs
    Remove duration override for realistic_env_max_load test
► testsuite/forge-cli/src/suites/realistic_environment.rs
    Enable latency-weighted leader selection and set multiplier in realistic_env_max_load_test
► testsuite/forge/src/config.rs
    Remove duration_override field from ForgeConfig
Enhancement ► consensus/src/epoch_manager.rs
    Implement conditional use of LatencyWeightedHeuristic based on configuration
► consensus/src/liveness/leader_reputation.rs
    Implement LatencyWeightedHeuristic for weighted leader selection based on round time performance
► testsuite/forge/src/runner.rs
    Remove duration_override from network test context

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@danielxiangzl danielxiangzl force-pushed the daniel/latency-weighted-leader-by-skip branch from cfc4529 to 677ccd7 Compare April 22, 2026 18:38
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@danielxiangzl danielxiangzl reopened this Apr 27, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

…onfig

Adds a continuous, per-validator weight scaling to LeaderReputation that
prefers validators with lower historical commit-to-commit interval as
proposers, gated behind a new on-chain config variant.

Heuristic (LatencyWeightedHeuristic in consensus/liveness/leader_reputation.rs):
- compute_round_times: split successful pairs 50/50 between newer and
  older proposer; attribute timeout-spanning gaps in full to the failed
  proposer(s) via failed_proposer_indices. Healthy adjacent proposers
  no longer absorb others' timeouts.
- get_weights: aggregate per-validator round-time observations using
  *mean* (not median, which discarded the failure tail). Scale active
  validators by (max_mean / val_mean)^multiplier, with a per-validator
  fallback when fewer than MIN_OBSERVATIONS=2 entries exist, a
  MAX_LATENCY_RATIO=10 ceiling on the boost, and degenerate-case guards
  (empty means / zero max_mean -> base weights).

On-chain gating (types/on_chain_config/consensus_config.rs):
- New ProposerAndVoterV3 variant of LeaderReputationType carrying
  ProposerAndVoterConfigV3 { base, use_latency_weighted,
  latency_weight_multiplier_milli } so all validators deterministically
  agree on whether to enable latency weighting and with what exponent
  (BCS-friendly integer milli-units; 1000 = 1.0x). Without on-chain
  gating, partial rollout would fork the chain.
- Version-agnostic proposer_and_voter_params() accessor returns the
  base config plus the latency-weighted toggle for V1/V2 (toggle=false)
  and V3.

Wiring:
- consensus/epoch_manager.rs: read base config + toggle through
  proposer_and_voter_params(); decode multiplier-milli to f64 at
  construction.
- consensus/dag/bootstrap.rs: handle V3 by using its base config (DAG
  anchor election does not yet wire LatencyWeighted; TODO marker added).
- testsuite/smoke-test/{state_sync,aptos_cli/validator}.rs: cover V3 in
  match arms.
- testsuite/forge-cli/realistic_environment.rs: use V3 in genesis with
  use_latency_weighted=true, latency_weight_multiplier_milli=1000, and
  proposer_window_num_validators_multiplier=50 (bumped from 10 so the
  heuristic gets ~350 blocks of history -- enough samples for the mean
  to stabilize on a 10%-failure validator).

Tests: adds 7 unit tests for the heuristic (50/50 split, failure
attribution, multi-failure split, per-validator fallback, empty
history, ratio clamp, cross-epoch skip). All 18 leader-reputation lib
tests pass.

Prototype/experiment code -- not for merge to main; the canonical merge
PR will need governance migration of the on-chain config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@danielxiangzl danielxiangzl force-pushed the daniel/latency-weighted-leader-by-skip branch from 14fcae1 to 5a7416a Compare April 28, 2026 19:18
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

danielxiangzl and others added 4 commits April 28, 2026 13:23
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…=5%, window=100x

Makes #19566 vs #19341 a clean isolation of the latency heuristic's contribution:
both branches now share the same binary classifier config, differing only in
use_latency_weighted and latency_weight_multiplier_milli.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@danielxiangzl danielxiangzl changed the title [consensus] Latency-weighted leader reputation gated by on-chain V3 config [consensus] Latency-weighted leader reputation + tighter classifier (combined) Apr 28, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

danielxiangzl and others added 3 commits April 28, 2026 17:56
…alty + carry-forward

Root cause of 3.0× instability (analyzed via forge runs at multipliers 2.0× / 3.0×):

The previous formula was `weight = active_weight * (max_mean / val_mean)^multiplier`.
This BOOSTED fast validators rather than PENALIZING slow ones — the slowest validator
(V6 in our test) received the BASE active_weight, not the lowest weight. The
exponentiation amplified small variations among healthy validators: a transient
20ms blip on a fast validator cliff-dropped its weight, redistributing load and
triggering cascading instability. At multiplier=2.0× the system was barely stable;
at 3.0× p99 exploded from ~1s to 8-41s with multi-minute oscillation cycles.

Compounding this, the MIN_OBSERVATIONS=2 fallback created a step-function:
when V6 was suppressed enough to drop below 2 observations in the window, V6 fell
back to base active_weight, became selectable, failed, accumulated observations,
got re-suppressed — a textbook oscillation.

This commit redesigns the heuristic with two changes:

1. **Median-reference asymmetric penalty.** Use the median of observed per-validator
   means as the reference. Validators at or below median: factor = 1.0 (no change).
   Validators above median: factor = 1 / (val_mean / median)^multiplier, clamped at
   MAX_LATENCY_RATIO. The slowest validator now gets the LARGEST penalty, healthy
   validators are not destabilized by small noise, and higher multipliers no longer
   amplify variance among the good band.

2. **Carry-forward state for unobserved validators.** A `Mutex<HashMap<Author,
   f64>>` tracks the last computed weight factor per author. When a validator has
   too few fresh observations (because it was suppressed enough to drop out of
   selection), we apply the previously-computed factor instead of falling back to
   base active_weight. Newly-rotated-in validators (no prior factor) still default
   to 1.0 → base active_weight. This breaks the suppress→starve→reset oscillation.

Tests:
- All 7 existing latency-weighted tests updated to match new formula.
- New `test_latency_weighted_carry_forward_for_unobserved_validator`: verifies
  V1's penalty is preserved across calls when V1 has too few fresh observations.
- `test_latency_weighted_max_ratio_clamp` updated to test the penalty floor (V1
  weight = active_weight / 10) rather than the boost ceiling (gone).

Forge config restored to multiplier=2.0× to validate the redesigned heuristic at
the previously-known-stable setting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

If suppressing V6 too aggressively shifts the structural cut-off onto V5
(geographic asymmetry hypothesis), a milder multiplier should keep V6
reasonably suppressed without making V5 the new bottleneck. Combined with
#19341's strict classifier (failed_weight=0, threshold=5%) which still hard-bans
V6 from leadership entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Forge suite realistic_env_max_load success on 371d6a0f46b9138bcdc4140bad6589c071f005fa

two traffics test: inner traffic : committed: 4000.00 txn/s, latency: 353.84 ms, (p50: 300 ms, p70: 300, p90: 500 ms, p99: 1200 ms), latency samples: 55280
two traffics test : committed: 99.99 txn/s, latency: 257.76 ms, (p50: 200 ms, p70: 300, p90: 300 ms, p99: 600 ms), latency samples: 3200
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 0.141, avg: 0.109", "ConsensusProposalToOrdered: max: 0.086, avg: 0.078", "ConsensusOrderedToCommit: max: 0.015, avg: 0.014", "ConsensusProposalToCommit: max: 0.099, avg: 0.092"]
Max non-epoch-change gap was: 1 rounds at version 7678 (avg 0.00) [limit 4], 1.13s no progress at version 427905 (avg 0.04s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.00s no progress at version 0 (avg 0.00s) [limit 16].
Test Ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CICD:run-forge-e2e-perf Run the e2e perf forge only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant