fix(consensus): pipeline / block-store fixes from Coacker audit triage#723
Open
keanji-x wants to merge 4 commits into
Open
fix(consensus): pipeline / block-store fixes from Coacker audit triage#723keanji-x wants to merge 4 commits into
keanji-x wants to merge 4 commits into
Conversation
In generate_executed_item_from_ordered, the new BlockInfo was constructed with block.timestamp_usecs() (the raw block timestamp) while every other field came from commit_info. At epoch boundaries, commit_info has already been adjusted via change_timestamp() to the epoch_end_timestamp; mirroring the block's original timestamp here caused the later try_advance_to_aggregated_with_ledger_info assert to panic on timestamp divergence when CommitDecision arrived for a reconfiguration suffix. Use commit_info.timestamp_usecs() to stay consistent with the rest of the BlockInfo construction (matching the equivalent reconstruction in the "voting power sufficient" path above). Refs: Galxe/gravity-audit#53 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The persisting_phase result arm previously logged the error and continued, leaving the pipeline silently stalled — committed blocks had already been popped from the buffer with no way to retry without a full reset, so the node would appear alive but stop making progress. Use std::process::abort() to terminate the node so the supervisor restarts it. We deliberately use abort() rather than panic!() because buffer_manager is launched via tokio::spawn (execution_client.rs:295), and a panic in a spawned task is caught by the runtime — the task exits but the process keeps running, reproducing the same silent stall. abort() bypasses unwinding and SIGABRTs the process, which any supervisor sees. The full in-process pipeline reset (so we can recover without crashing) remains a TODO; aborting is a stopgap until that exists. Crashing is strictly better than silent stall: noticeable, recoverable via restart, and preserves the invariant "buffer_manager is alive => pipeline advances". Refs: Galxe/gravity-audit#54 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When init_block_number encountered an ordered block whose parent was reachable from neither the in-memory tree nor persistent storage, it panicked. This is reachable from externally-sourced ordered blocks (e.g. malformed sync data on startup) — a recoverable error, not an invariant violation. Convert to anyhow::bail! and propagate from the single caller (send_for_execution, which already returns Result). The error message includes the parent id, epoch and block id to aid post-mortem. The other three panic points in this file flagged by the audit (line 518 block_number mismatch, line 731 hash mismatch, line 896 Arc::try_unwrap) are kept as-is — those are genuine invariant violations where panicking is the correct behavior. Refs: Galxe/gravity-audit#517 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The early is_epoch_change() check in get_ordered_blocks read buffer_state outside the mutex. If buffer_state flipped to EpochChange between the unlocked check and the subsequent mutex acquire, the in-lock recheck (at the start of the loop) would catch it and return the same error — so the pre-check was redundant and only widened the TOCTOU window for analysis. Removing it doesn't change behavior (caller still hits the in-lock check before any state read) and reduces the surface area of audit-flagged synchronization concerns around the dual AtomicU8 + Mutex design. Refs: Galxe/gravity-audit#522 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b08c94e to
151b8a8
Compare
keanji-x
commented
May 20, 2026
| if self.is_epoch_change() { | ||
| return Err(anyhow::anyhow!("Buffer is in epoch change")); | ||
| } | ||
| // Note: the previous early `is_epoch_change()` check here was redundant |
| // This callback is invoked synchronously with and could be used for multiple batches of | ||
| // blocks. | ||
| self.init_block_number(&blocks_to_commit); | ||
| self.init_block_number(&blocks_to_commit)?; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes Galxe/gravity-audit#53
Closes Galxe/gravity-audit#54
Closes Galxe/gravity-audit#517
Closes Galxe/gravity-audit#522
Summary
Four targeted fixes from triaging the Coacker audit backlog in
Galxe/gravity-audit. Scope is intentionally narrow — only the pipeline / block-store findings whose fix didn't require a design discussion. The remaining audit findings (consensus protocol layer, leader-election randomness, upstream backports of aptos-labs PR #14637 / #14644) are tracked separately.Fixes
1fd21afbuffer_item.rs: usecommit_info.timestamp_usecs()instead ofblock.timestamp_usecs()ingenerate_executed_item_from_orderedso the BlockInfo stays consistent across epoch boundaries (wherechange_timestamp()has been applied). Without this, the suffix-block path can deterministically panic at the later commit-proof assert.7559cc1buffer_manager.rs: persisting-phase error arm previously only logged and continued, letting the pipeline silently stall after committed blocks had been popped. Convert to an explicit panic so the supervisor restarts. (Full in-process reset is still a TODO, but loud crash > silent stall.)f059b24block_store.rs: convert the "cannot find parent_block id" panic ininit_block_numbertoanyhow::bail!and propagate fromsend_for_execution. The other three panic points in this file flagged by the audit (line 518 / 731 / 896) are intentionally kept — those are real invariant violations where panicking is correct.b08c94eblock_buffer_manager.rs: remove the redundant unlockedis_epoch_change()early check inget_ordered_blocks. The in-lock recheck at the start of the wait loop already handles the same condition, so the pre-check only widened the TOCTOU window without changing semantics.Verification
RUSTFLAGS="--cfg tokio_unstable" cargo check --manifest-path aptos-core/consensus/Cargo.toml✓RUSTFLAGS="--cfg tokio_unstable" cargo check --manifest-path crates/block-buffer-manager/Cargo.toml✓Out of scope (tracked separately)
IncrementalProofState::take→aggregate_and_verifyreturning Result)initiate_new_epochshutdown-before-sync ordering (upstream has the same bug; needs design)leader_reputationroot_hash = HashValue::zero(); needs design discussion on randomness source (VRF vs. Reth-derived hash)Test plan
cargo testsentinelbinary for log file monitoring #522 fix doesn't regress epoch-change response latency (semantically same; only one fewer unlocked atomic read on the fast path)🤖 Generated with Claude Code