Fill gaps to prevent missing payload or VID common by bfish713 · Pull Request #4434 · EspressoSystems/espresso-network

bfish713 · 2026-06-05T01:44:24Z

Closes #<ISSUE_NUMBER>

This PR:

This PR does not:

Key places to review:

gemini-code-assist · 2026-06-05T01:44:27Z

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

claude · 2026-06-05T16:47:38Z

Claude finished @bfish713's task in 4m 3s —— View job

Code Review: Fill gaps to prevent missing payload or VID common

Read repository guidelines (CLAUDE.md)
Gather context and read all changed files
Review code quality, correctness, and error handling
Review security considerations
Review performance implications
Review test coverage
Post detailed review feedback

Summary

This PR implements a multi-layered recovery strategy for block payloads and VID data in the decide pipeline. The approach is well-architected: in-memory data → grace deferral → peer recovery → late back-fill → query service fetching. Each layer catches what the previous one misses, and the whole system degrades gracefully.

Overall assessment: Approve with minor suggestions. The architecture is sound, the changes are consistent across both persistence backends (fs/sql), and the test coverage is thorough.

Architecture & Design

Strengths:

The DecideEventData struct cleanly captures in-memory payload/VID/cert2 data by view, eliminating the need for immediate disk reads after decide
DecideDataDeferral with grace periods is a pragmatic solution for the async storage write race: rather than adding complex synchronization, it simply waits a configurable duration before giving up on missing data
The RECONSTRUCT_KEEP_HORIZON in the VID reconstructor prevents a subtle multi-leaf-decide bug where GC would abort reconstructions for views that were just decided
Bounded retries in storage.rs (replacing infinite loops) with MAX_APPEND_ATTEMPTS=100 and GC_ABORT_HORIZON=100 is a significant reliability improvement

DA/VID retention change: Payloads and VID shares are no longer deleted at decide time — they're only pruned by the retention-based garbage collector. This is consistent across both fs and sql backends and enables peers to recover payloads from each other.

Findings

1. Height gap log level may be too aggressive (sql.rs:991)

Height gaps logged at error! level could be noisy in production during catchup or after restarts. Consider warn! — error typically implies immediate operator action is needed. [Inline comment posted]

2. Recovery loop re-queries full chain (sql.rs:1120-1133)

When cut == 0 and recovery produces results, the continue re-reads the entire leaf chain from the DB (including vid_share2 and da_proposal2 fetch_all queries). This is correct behavior (recovered data needs to be visible), but could be expensive for large chains. A brief comment documenting this design choice would help future readers. [Inline comment posted]

3. Recovery attempt counting includes setup failures (payload_recovery.rs:108-118)

If stake_table_for_epoch fails (e.g. epoch not yet in cache), one of the MAX_PAYLOAD_RECOVERY_ATTEMPTS (3) gets burned on a transient lookup failure rather than a genuine peer-recovery failure. Consider only incrementing the counter on actual network attempts. Not a blocker — 3 attempts with 30s intervals gives sufficient time. [Inline comment posted]

4. `std::sync::Mutex` in async context — correctly used

DecideDataDeferral uses std::sync::Mutex rather than tokio::sync::Mutex. This is the right choice here: all lock acquisitions are extremely short-lived (insert/remove from a BTreeMap, no .await while holding the lock), and std::Mutex has lower overhead.

5. Asymmetric `Arc` wrapping between backends

In sql.rs, missing_decide_data is Arc<DecideDataDeferral> (because Persistence derives Clone), while in fs.rs it's a plain DecideDataDeferral inside Arc<RwLock<Inner>>. Both are correct for their respective ownership models — just noting the difference for future maintainers. [Inline comment posted]

Security

No security concerns. The payload recovery path correctly verifies recovered payloads by recomputing the VID commitment against the epoch's stake table parameters before accepting them (payload_recovery.rs). The bounded retry loops prevent resource exhaustion from persistently failing storage writes.

Performance

In-memory fast path: When DecideEventData covers all views in the decide batch, the sql backend skips VID/DA database queries entirely — good optimization for the common case
Grace deferral: The configurable ESPRESSO_NODE_DECIDE_PAYLOAD_GRACE (default 10s) avoids unnecessary peer recovery attempts for data that's simply still being written to disk
Batched recovery: PAYLOAD_RECOVERY_BATCH=3 limits concurrent recovery requests, preventing network flooding
Recovery horizon: PAYLOAD_RECOVERY_HORIZON=130000 (~3 days of views) bounds how far back recovery will attempt, preventing unbounded historical scanning

Test Coverage

Test coverage is thorough:

4 unit tests for DecideDataDeferral (deferral/readiness timing, attempt tracking, batch limiting, horizon enforcement)
4 integration tests for the decide pipeline (test_decide_from_memory, test_decide_from_memory_partial_coverage, test_decide_payload_deferral, test_decide_payload_recovery_after_grace)
1 new VID test (test_gc_keeps_recent_reconstructions) verifying the keep-horizon behavior
Existing test assertions correctly updated from .is_none() to .is_some() for DA/VID after decide (reflecting the retention change)
Test helpers (mock_chain, mock_chain_with_txns, live_decide_data) are well-structured and reusable

Reviewed with Claude Code

claude · 2026-06-05T16:48:58Z

@@ -956,34 +991,51 @@
                if let Some(parent) = parent


nit: This was upgraded from debug to error — is this intentional? A height gap could be a transient state during catchup or after a restart, and error level is quite noisy in production logs. warn might be a better fit; error usually implies "operator action needed now".

claude · 2026-06-05T16:49:01Z

+                        .missing_decide_data
+                        .recovery_viable(leaf.view_number().u64())
+            };
+            let now = Instant::now();
+            let cut = leaves
+                .iter()
+                .position(|(leaf, _)| {
+                    !data_complete(leaf)
+                        && (self.missing_decide_data.should_defer(
+                            leaf.view_number().u64(),
+                            self.payload_grace,
+                            now,
+                        ) || recovery_viable(leaf))
+                })


When cut == 0 and recovery produces results, this continues back to the top of the outer loop, which re-reads the entire leaf chain from the DB (including the fetch_all queries for vid_share2 and da_proposal2). This is correct but could be expensive if the chain is large. A brief comment noting this design choice (re-query after recovery so the recovered data is visible) would help future readers.

claude · 2026-06-05T16:49:07Z

    db: SqlStorage,
    gc_opt: ConsensusPruningOptions,
+    /// Grace period to wait for missing payload/VID data before emitting a decide event
+    /// without it.
+    payload_grace: Duration,
+    /// Tracks views with missing payload/VID data at decide time, for the grace period.
+    missing_decide_data: Arc<DecideDataDeferral>,
    /// A reference to the internal metrics


The SQL Persistence derives Clone, and missing_decide_data is behind an Arc. But it also derives Debug, and DecideDataDeferral contains Mutex<BTreeMap<..>> — if the mutex is poisoned, .fmt() on a debug print could panic. This is a pre-existing pattern throughout the codebase so not a blocker, just flagging it.

More substantively: since missing_decide_data is Arc-shared across clones, the deferral bookkeeping is correctly shared across all handles, which is the right behavior for the watch-channel-driven decide processor.

claude · 2026-06-05T16:49:13Z

+        let epoch = leaf.epoch(self.epoch_height);
+        let total_weight = vid_total_weight::<SeqTypes, _>(
+            self.membership
+                .stake_table_for_epoch(epoch)
+                .map_err(|err| {
+                    anyhow::anyhow!("failed to get stake table for epoch {epoch:?}: {err:#}")
+                })?
+                .stake_table(),
+            epoch,
+        );
+


If stake_table_for_epoch fails (e.g. the epoch is not yet in the membership coordinator's cache), this propagates the error and the recovery attempt is counted as failed. Since the decide processor retries, this is fine in practice, but it means one of the MAX_PAYLOAD_RECOVERY_ATTEMPTS (3) gets "burned" on a potentially transient epoch-lookup failure rather than a genuine peer-recovery failure.

Consider whether the attempt counter should only be incremented on actual network attempts (timeout, bad response), not on setup failures. Not a blocker — 3 attempts with 30s retry intervals gives plenty of time for the epoch to arrive.

Fill gaps to prevent missing payload or VID common

2b810d3

bfish713 added 4 commits June 5, 2026 08:47

use in memory data for decide event generation

e6b846f

Merge branch 'main' into bf/data

3a3de87

Merge branch 'main' into bf/data

bd64244

fix tests

c2bb57f

bfish713 marked this pull request as ready for review June 5, 2026 16:47

bfish713 requested review from imabdulbasit, jbearer, lukaszrzasik, ss-es and sveitser as code owners June 5, 2026 16:47

claude Bot reviewed Jun 5, 2026

View reviewed changes

bfish713 marked this pull request as draft June 5, 2026 18:18

remove grace period

80bd401

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fill gaps to prevent missing payload or VID common#4434

Fill gaps to prevent missing payload or VID common#4434
bfish713 wants to merge 6 commits into
mainfrom
bf/data

bfish713 commented Jun 5, 2026

Uh oh!

gemini-code-assist Bot commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

claude Bot Jun 5, 2026

Uh oh!

claude Bot Jun 5, 2026

Uh oh!

claude Bot Jun 5, 2026

Uh oh!

claude Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bfish713 commented Jun 5, 2026

This PR:

This PR does not:

Key places to review:

Uh oh!

gemini-code-assist Bot commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Fill gaps to prevent missing payload or VID common

Summary

Architecture & Design

Findings

1. Height gap log level may be too aggressive (sql.rs:991)

2. Recovery loop re-queries full chain (sql.rs:1120-1133)

3. Recovery attempt counting includes setup failures (payload_recovery.rs:108-118)

4. std::sync::Mutex in async context — correctly used

5. Asymmetric Arc wrapping between backends

Security

Performance

Test Coverage

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jun 5, 2026 •

edited

Loading

4. `std::sync::Mutex` in async context — correctly used

5. Asymmetric `Arc` wrapping between backends