Skip to content

Fill gaps to prevent missing payload or VID common#4434

Draft
bfish713 wants to merge 6 commits into
mainfrom
bf/data
Draft

Fill gaps to prevent missing payload or VID common#4434
bfish713 wants to merge 6 commits into
mainfrom
bf/data

Conversation

@bfish713
Copy link
Copy Markdown
Contributor

@bfish713 bfish713 commented Jun 5, 2026

Closes #<ISSUE_NUMBER>

This PR:

This PR does not:

Key places to review:

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

@bfish713 bfish713 marked this pull request as ready for review June 5, 2026 16:47
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 5, 2026

Claude finished @bfish713's task in 4m 3s —— View job


Code Review: Fill gaps to prevent missing payload or VID common

  • Read repository guidelines (CLAUDE.md)
  • Gather context and read all changed files
  • Review code quality, correctness, and error handling
  • Review security considerations
  • Review performance implications
  • Review test coverage
  • Post detailed review feedback

Summary

This PR implements a multi-layered recovery strategy for block payloads and VID data in the decide pipeline. The approach is well-architected: in-memory data → grace deferral → peer recovery → late back-fill → query service fetching. Each layer catches what the previous one misses, and the whole system degrades gracefully.

Overall assessment: Approve with minor suggestions. The architecture is sound, the changes are consistent across both persistence backends (fs/sql), and the test coverage is thorough.


Architecture & Design

Strengths:

  • The DecideEventData struct cleanly captures in-memory payload/VID/cert2 data by view, eliminating the need for immediate disk reads after decide
  • DecideDataDeferral with grace periods is a pragmatic solution for the async storage write race: rather than adding complex synchronization, it simply waits a configurable duration before giving up on missing data
  • The RECONSTRUCT_KEEP_HORIZON in the VID reconstructor prevents a subtle multi-leaf-decide bug where GC would abort reconstructions for views that were just decided
  • Bounded retries in storage.rs (replacing infinite loops) with MAX_APPEND_ATTEMPTS=100 and GC_ABORT_HORIZON=100 is a significant reliability improvement

DA/VID retention change: Payloads and VID shares are no longer deleted at decide time — they're only pruned by the retention-based garbage collector. This is consistent across both fs and sql backends and enables peers to recover payloads from each other.


Findings

1. Height gap log level may be too aggressive (sql.rs:991)

Height gaps logged at error! level could be noisy in production during catchup or after restarts. Consider warn!error typically implies immediate operator action is needed. [Inline comment posted]

2. Recovery loop re-queries full chain (sql.rs:1120-1133)

When cut == 0 and recovery produces results, the continue re-reads the entire leaf chain from the DB (including vid_share2 and da_proposal2 fetch_all queries). This is correct behavior (recovered data needs to be visible), but could be expensive for large chains. A brief comment documenting this design choice would help future readers. [Inline comment posted]

3. Recovery attempt counting includes setup failures (payload_recovery.rs:108-118)

If stake_table_for_epoch fails (e.g. epoch not yet in cache), one of the MAX_PAYLOAD_RECOVERY_ATTEMPTS (3) gets burned on a transient lookup failure rather than a genuine peer-recovery failure. Consider only incrementing the counter on actual network attempts. Not a blocker — 3 attempts with 30s intervals gives sufficient time. [Inline comment posted]

4. std::sync::Mutex in async context — correctly used

DecideDataDeferral uses std::sync::Mutex rather than tokio::sync::Mutex. This is the right choice here: all lock acquisitions are extremely short-lived (insert/remove from a BTreeMap, no .await while holding the lock), and std::Mutex has lower overhead.

5. Asymmetric Arc wrapping between backends

In sql.rs, missing_decide_data is Arc<DecideDataDeferral> (because Persistence derives Clone), while in fs.rs it's a plain DecideDataDeferral inside Arc<RwLock<Inner>>. Both are correct for their respective ownership models — just noting the difference for future maintainers. [Inline comment posted]


Security

No security concerns. The payload recovery path correctly verifies recovered payloads by recomputing the VID commitment against the epoch's stake table parameters before accepting them (payload_recovery.rs). The bounded retry loops prevent resource exhaustion from persistently failing storage writes.


Performance

  • In-memory fast path: When DecideEventData covers all views in the decide batch, the sql backend skips VID/DA database queries entirely — good optimization for the common case
  • Grace deferral: The configurable ESPRESSO_NODE_DECIDE_PAYLOAD_GRACE (default 10s) avoids unnecessary peer recovery attempts for data that's simply still being written to disk
  • Batched recovery: PAYLOAD_RECOVERY_BATCH=3 limits concurrent recovery requests, preventing network flooding
  • Recovery horizon: PAYLOAD_RECOVERY_HORIZON=130000 (~3 days of views) bounds how far back recovery will attempt, preventing unbounded historical scanning

Test Coverage

Test coverage is thorough:

  • 4 unit tests for DecideDataDeferral (deferral/readiness timing, attempt tracking, batch limiting, horizon enforcement)
  • 4 integration tests for the decide pipeline (test_decide_from_memory, test_decide_from_memory_partial_coverage, test_decide_payload_deferral, test_decide_payload_recovery_after_grace)
  • 1 new VID test (test_gc_keeps_recent_reconstructions) verifying the keep-horizon behavior
  • Existing test assertions correctly updated from .is_none() to .is_some() for DA/VID after decide (reflecting the retention change)
  • Test helpers (mock_chain, mock_chain_with_txns, live_decide_data) are well-structured and reusable

Reviewed with Claude Code

@@ -956,34 +991,51 @@
if let Some(parent) = parent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This was upgraded from debug to error — is this intentional? A height gap could be a transient state during catchup or after a restart, and error level is quite noisy in production logs. warn might be a better fit; error usually implies "operator action needed now".

Comment on lines +1120 to +1133
.missing_decide_data
.recovery_viable(leaf.view_number().u64())
};
let now = Instant::now();
let cut = leaves
.iter()
.position(|(leaf, _)| {
!data_complete(leaf)
&& (self.missing_decide_data.should_defer(
leaf.view_number().u64(),
self.payload_grace,
now,
) || recovery_viable(leaf))
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cut == 0 and recovery produces results, this continues back to the top of the outer loop, which re-reads the entire leaf chain from the DB (including the fetch_all queries for vid_share2 and da_proposal2). This is correct but could be expensive if the chain is large. A brief comment noting this design choice (re-query after recovery so the recovered data is visible) would help future readers.

Comment on lines 759 to 766
db: SqlStorage,
gc_opt: ConsensusPruningOptions,
/// Grace period to wait for missing payload/VID data before emitting a decide event
/// without it.
payload_grace: Duration,
/// Tracks views with missing payload/VID data at decide time, for the grace period.
missing_decide_data: Arc<DecideDataDeferral>,
/// A reference to the internal metrics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL Persistence derives Clone, and missing_decide_data is behind an Arc. But it also derives Debug, and DecideDataDeferral contains Mutex<BTreeMap<..>> — if the mutex is poisoned, .fmt() on a debug print could panic. This is a pre-existing pattern throughout the codebase so not a blocker, just flagging it.

More substantively: since missing_decide_data is Arc-shared across clones, the deferral bookkeeping is correctly shared across all handles, which is the right behavior for the watch-channel-driven decide processor.

Comment on lines +108 to +118
let epoch = leaf.epoch(self.epoch_height);
let total_weight = vid_total_weight::<SeqTypes, _>(
self.membership
.stake_table_for_epoch(epoch)
.map_err(|err| {
anyhow::anyhow!("failed to get stake table for epoch {epoch:?}: {err:#}")
})?
.stake_table(),
epoch,
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If stake_table_for_epoch fails (e.g. the epoch is not yet in the membership coordinator's cache), this propagates the error and the recovery attempt is counted as failed. Since the decide processor retries, this is fine in practice, but it means one of the MAX_PAYLOAD_RECOVERY_ATTEMPTS (3) gets "burned" on a potentially transient epoch-lookup failure rather than a genuine peer-recovery failure.

Consider whether the attempt counter should only be incremented on actual network attempts (timeout, bad response), not on setup failures. Not a blocker — 3 attempts with 30s retry intervals gives plenty of time for the epoch to arrive.

@bfish713 bfish713 marked this pull request as draft June 5, 2026 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant