Conversation
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. |
|
Claude finished @bfish713's task in 4m 3s —— View job Code Review: Fill gaps to prevent missing payload or VID common
SummaryThis PR implements a multi-layered recovery strategy for block payloads and VID data in the decide pipeline. The approach is well-architected: in-memory data → grace deferral → peer recovery → late back-fill → query service fetching. Each layer catches what the previous one misses, and the whole system degrades gracefully. Overall assessment: Approve with minor suggestions. The architecture is sound, the changes are consistent across both persistence backends (fs/sql), and the test coverage is thorough. Architecture & DesignStrengths:
DA/VID retention change: Payloads and VID shares are no longer deleted at decide time — they're only pruned by the retention-based garbage collector. This is consistent across both fs and sql backends and enables peers to recover payloads from each other. Findings1. Height gap log level may be too aggressive (sql.rs:991)Height gaps logged at 2. Recovery loop re-queries full chain (sql.rs:1120-1133)When 3. Recovery attempt counting includes setup failures (payload_recovery.rs:108-118)If 4.
|
| @@ -956,34 +991,51 @@ | |||
| if let Some(parent) = parent | |||
There was a problem hiding this comment.
nit: This was upgraded from debug to error — is this intentional? A height gap could be a transient state during catchup or after a restart, and error level is quite noisy in production logs. warn might be a better fit; error usually implies "operator action needed now".
| .missing_decide_data | ||
| .recovery_viable(leaf.view_number().u64()) | ||
| }; | ||
| let now = Instant::now(); | ||
| let cut = leaves | ||
| .iter() | ||
| .position(|(leaf, _)| { | ||
| !data_complete(leaf) | ||
| && (self.missing_decide_data.should_defer( | ||
| leaf.view_number().u64(), | ||
| self.payload_grace, | ||
| now, | ||
| ) || recovery_viable(leaf)) | ||
| }) |
There was a problem hiding this comment.
When cut == 0 and recovery produces results, this continues back to the top of the outer loop, which re-reads the entire leaf chain from the DB (including the fetch_all queries for vid_share2 and da_proposal2). This is correct but could be expensive if the chain is large. A brief comment noting this design choice (re-query after recovery so the recovered data is visible) would help future readers.
| db: SqlStorage, | ||
| gc_opt: ConsensusPruningOptions, | ||
| /// Grace period to wait for missing payload/VID data before emitting a decide event | ||
| /// without it. | ||
| payload_grace: Duration, | ||
| /// Tracks views with missing payload/VID data at decide time, for the grace period. | ||
| missing_decide_data: Arc<DecideDataDeferral>, | ||
| /// A reference to the internal metrics |
There was a problem hiding this comment.
The SQL Persistence derives Clone, and missing_decide_data is behind an Arc. But it also derives Debug, and DecideDataDeferral contains Mutex<BTreeMap<..>> — if the mutex is poisoned, .fmt() on a debug print could panic. This is a pre-existing pattern throughout the codebase so not a blocker, just flagging it.
More substantively: since missing_decide_data is Arc-shared across clones, the deferral bookkeeping is correctly shared across all handles, which is the right behavior for the watch-channel-driven decide processor.
| let epoch = leaf.epoch(self.epoch_height); | ||
| let total_weight = vid_total_weight::<SeqTypes, _>( | ||
| self.membership | ||
| .stake_table_for_epoch(epoch) | ||
| .map_err(|err| { | ||
| anyhow::anyhow!("failed to get stake table for epoch {epoch:?}: {err:#}") | ||
| })? | ||
| .stake_table(), | ||
| epoch, | ||
| ); | ||
|
|
There was a problem hiding this comment.
If stake_table_for_epoch fails (e.g. the epoch is not yet in the membership coordinator's cache), this propagates the error and the recovery attempt is counted as failed. Since the decide processor retries, this is fine in practice, but it means one of the MAX_PAYLOAD_RECOVERY_ATTEMPTS (3) gets "burned" on a potentially transient epoch-lookup failure rather than a genuine peer-recovery failure.
Consider whether the attempt counter should only be incremented on actual network attempts (timeout, bad response), not on setup failures. Not a blocker — 3 attempts with 30s retry intervals gives plenty of time for the epoch to arrive.
Closes #<ISSUE_NUMBER>
This PR:
This PR does not:
Key places to review: