Skip to content

fix(consensus): pipeline / block-store fixes from Coacker audit triage#723

Open
keanji-x wants to merge 4 commits into
Galxe:mainfrom
keanji-x:kj/fix-audit-pipeline-bugs
Open

fix(consensus): pipeline / block-store fixes from Coacker audit triage#723
keanji-x wants to merge 4 commits into
Galxe:mainfrom
keanji-x:kj/fix-audit-pipeline-bugs

Conversation

@keanji-x
Copy link
Copy Markdown
Contributor

@keanji-x keanji-x commented May 19, 2026

Closes Galxe/gravity-audit#53
Closes Galxe/gravity-audit#54
Closes Galxe/gravity-audit#517
Closes Galxe/gravity-audit#522

Summary

Four targeted fixes from triaging the Coacker audit backlog in Galxe/gravity-audit. Scope is intentionally narrow — only the pipeline / block-store findings whose fix didn't require a design discussion. The remaining audit findings (consensus protocol layer, leader-election randomness, upstream backports of aptos-labs PR #14637 / #14644) are tracked separately.

Note on #517: this PR fixes only the externally-reachable panic (line 515 — missing parent block). The other three panic points cited in #517 (lines 518 / 731 / 896) are internal invariants where halting is the correct behavior; analysis posted as a comment on the issue. The Closes #517 keyword applies because the issue is fully resolved by code-fix + analysis-comment combined.

Fixes

Audit issue Commit Severity Change
Galxe/gravity-audit#53 1fd21af Critical buffer_item.rs: use commit_info.timestamp_usecs() instead of block.timestamp_usecs() in generate_executed_item_from_ordered so the BlockInfo stays consistent across epoch boundaries (where change_timestamp() has been applied). Without this, the suffix-block path can deterministically panic at the later commit-proof assert.
Galxe/gravity-audit#54 7559cc1 Critical buffer_manager.rs: persisting-phase error arm previously only logged and continued, letting the pipeline silently stall after committed blocks had been popped. Convert to an explicit panic so the supervisor restarts. (Full in-process reset is still a TODO, but loud crash > silent stall.)
Galxe/gravity-audit#517 f059b24 Critical block_store.rs: convert the "cannot find parent_block id" panic in init_block_number to anyhow::bail! and propagate from send_for_execution. The other three panic points in this file flagged by the audit (line 518 / 731 / 896) are intentionally kept — those are real invariant violations where panicking is correct.
Galxe/gravity-audit#522 b08c94e High block_buffer_manager.rs: remove the redundant unlocked is_epoch_change() early check in get_ordered_blocks. The in-lock recheck at the start of the wait loop already handles the same condition, so the pre-check only widened the TOCTOU window without changing semantics.

Verification

  • RUSTFLAGS="--cfg tokio_unstable" cargo check --manifest-path aptos-core/consensus/Cargo.toml
  • RUSTFLAGS="--cfg tokio_unstable" cargo check --manifest-path crates/block-buffer-manager/Cargo.toml

Out of scope (tracked separately)

Test plan

🤖 Generated with Claude Code

keanji-x and others added 4 commits May 19, 2026 14:53
In generate_executed_item_from_ordered, the new BlockInfo was constructed
with block.timestamp_usecs() (the raw block timestamp) while every other
field came from commit_info. At epoch boundaries, commit_info has already
been adjusted via change_timestamp() to the epoch_end_timestamp; mirroring
the block's original timestamp here caused the later
try_advance_to_aggregated_with_ledger_info assert to panic on timestamp
divergence when CommitDecision arrived for a reconfiguration suffix.

Use commit_info.timestamp_usecs() to stay consistent with the rest of the
BlockInfo construction (matching the equivalent reconstruction in the
"voting power sufficient" path above).

Refs: Galxe/gravity-audit#53

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The persisting_phase result arm previously logged the error and continued,
leaving the pipeline silently stalled — committed blocks had already been
popped from the buffer with no way to retry without a full reset, so the
node would appear alive but stop making progress.

Use std::process::abort() to terminate the node so the supervisor
restarts it. We deliberately use abort() rather than panic!() because
buffer_manager is launched via tokio::spawn (execution_client.rs:295),
and a panic in a spawned task is caught by the runtime — the task exits
but the process keeps running, reproducing the same silent stall. abort()
bypasses unwinding and SIGABRTs the process, which any supervisor sees.

The full in-process pipeline reset (so we can recover without crashing)
remains a TODO; aborting is a stopgap until that exists. Crashing is
strictly better than silent stall: noticeable, recoverable via restart,
and preserves the invariant "buffer_manager is alive => pipeline advances".

Refs: Galxe/gravity-audit#54

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When init_block_number encountered an ordered block whose parent was
reachable from neither the in-memory tree nor persistent storage, it
panicked. This is reachable from externally-sourced ordered blocks (e.g.
malformed sync data on startup) — a recoverable error, not an invariant
violation.

Convert to anyhow::bail! and propagate from the single caller
(send_for_execution, which already returns Result). The error message
includes the parent id, epoch and block id to aid post-mortem.

The other three panic points in this file flagged by the audit (line 518
block_number mismatch, line 731 hash mismatch, line 896 Arc::try_unwrap)
are kept as-is — those are genuine invariant violations where panicking
is the correct behavior.

Refs: Galxe/gravity-audit#517

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The early is_epoch_change() check in get_ordered_blocks read buffer_state
outside the mutex. If buffer_state flipped to EpochChange between the
unlocked check and the subsequent mutex acquire, the in-lock recheck (at
the start of the loop) would catch it and return the same error — so the
pre-check was redundant and only widened the TOCTOU window for analysis.

Removing it doesn't change behavior (caller still hits the in-lock check
before any state read) and reduces the surface area of audit-flagged
synchronization concerns around the dual AtomicU8 + Mutex design.

Refs: Galxe/gravity-audit#522

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@keanji-x keanji-x force-pushed the kj/fix-audit-pipeline-bugs branch from b08c94e to 151b8a8 Compare May 19, 2026 08:50
if self.is_epoch_change() {
return Err(anyhow::anyhow!("Buffer is in epoch change"));
}
// Note: the previous early `is_epoch_change()` check here was redundant
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why

// This callback is invoked synchronously with and could be used for multiple batches of
// blocks.
self.init_block_number(&blocks_to_commit);
self.init_block_number(&blocks_to_commit)?;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant