Skip to content

[new-protocol] garbage collect consensus data for views less than decided view#4415

Open
imabdulbasit wants to merge 9 commits into
mainfrom
ab/gc-below-decided
Open

[new-protocol] garbage collect consensus data for views less than decided view#4415
imabdulbasit wants to merge 9 commits into
mainfrom
ab/gc-below-decided

Conversation

@imabdulbasit
Copy link
Copy Markdown
Contributor

No description provided.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 1, 2026

Claude finished @imabdulbasit's task in 2m 20s —— View job


Code Review: PR #4415

  • Gather context and read changed files
  • Review changes in detail
  • Post review feedback

Summary

Clean simplification that removes the configurable garbage_collection_interval and instead triggers GC on every decide. The config option, its CLI flag (ESPRESSO_NODE_NEW_PROTOCOL_CONSENSUS_GC_INTERVAL), and all plumbing through NetworkParams, Options, PublicNodeConfig, and Consensus::new are fully removed — no orphaned references remain.

Observations

Correctness: The change looks correct. Previously, GC was triggered only when block_number % gc_interval == 0 (searching backward through the decided chain for the first matching block). Now it fires unconditionally on every decide at the decided view. The gc() method uses BTreeMap::split_off(&view) which retains entries >= view, so the decided view's data is preserved — consistent with the "strictly below" comment.

Behavioral change — checkpoint votes on every decide: Before this PR, SendCheckpointVote was emitted only at GC intervals (default every 100 blocks). Now it's emitted on every decide. This means more checkpoint votes are broadcast over the network and accumulated by the checkpoint_collector. If this is intentional (faster checkpoint certificate formation), great — but worth confirming this won't create undue network chatter in production.

Minor cleanup opportunity: Posted an inline comment — gc is now always Some(...), so the Option and if let Some(gc) guard are dead code that could be inlined for clarity.

Testing: The test files are updated correctly (all 100 interval args removed from Consensus::new calls). The existing test coverage should continue to exercise GC behavior since it now fires on every decide rather than every Nth block.

Overall this is a clean, well-scoped removal. LGTM pending the minor nit.

Comment thread crates/hotshot/new-protocol/src/consensus.rs Outdated
@imabdulbasit imabdulbasit force-pushed the ab/gc-below-decided branch from 2c61646 to d4c17ed Compare June 1, 2026 12:37
Copy link
Copy Markdown
Contributor

@bfish713 bfish713 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should completely remove the garbage collection certificate if we are doing this. I think this will trigger voting for a checkpoint every decide, which is an extra cert per view we don't actually need.

@imabdulbasit imabdulbasit requested a review from bfish713 June 2, 2026 09:52
imabdulbasit and others added 2 commits June 2, 2026 14:58
# Conflicts:
#	crates/espresso/node/src/context.rs
#	crates/hotshot/new-protocol/src/coordinator.rs
@imabdulbasit imabdulbasit requested a review from twittner June 2, 2026 13:03
@imabdulbasit imabdulbasit marked this pull request as draft June 2, 2026 21:53
@imabdulbasit imabdulbasit marked this pull request as ready for review June 3, 2026 15:54
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 3, 2026

Claude finished @imabdulbasit's task in 3m 43s —— View job


Code Review: PR #4415

  • Gather context (previous comments, changed files)
  • Review changes in detail
  • Resolve or respond to previous review threads
  • Post review feedback

Summary

This PR addresses @bfish713's feedback from the first round: the checkpoint certificate mechanism (CheckpointData, CheckpointVote, CheckpointCertificate, checkpoint_collector) has been completely removed, and GC is now triggered directly in the coordinator's LeafDecided handler rather than via a quorum-collected checkpoint cert. This is a clean simplification — the indirection through a checkpoint certificate was unnecessary once GC fires on every decide.

What Changed

  1. Checkpoint certificate removal: CheckpointData, CheckpointVote, CheckpointCertificate, ConsensusMessage::Checkpoint, ConsensusOutput::SendCheckpointVote, and the checkpoint_collector are all gone. The garbage_collection_interval config parameter and its CLI flag (ESPRESSO_NODE_NEW_PROTOCOL_CONSENSUS_GC_INTERVAL) are removed from Options, PublicNodeConfig, NetworkParams, and Consensus::new.

  2. GC on every decide: In process_consensus_output for LeafDecided, the coordinator now calls self.gc(gc_view, gc_epoch) directly using the newest decided leaf's view. Previously this required a quorum of checkpoint votes to form a certificate before GC could fire.

  3. External message view workaround: MessageKind::External now returns ViewNumber::new(u64::MAX) instead of ViewNumber::new(1), preventing external messages from being garbage collected by the more frequent GC passes. Acknowledged as a temporary workaround via TODO comment.

  4. Test fix: The test_leader_proposes_with_computed_drb_in_epoch3 test was updated to wait for HeaderCreated for view 28 and assert on the specific view, fixing a race where GC on every decide could clear state before the assertion ran.

Review Findings

Correctness — GC on every decide is sound: The gc() method uses BTreeMap::split_off(&view) which retains entries >= view, so the decided view's data is preserved. Since GC previously required waiting for a checkpoint certificate (quorum round-trip), the system was already designed to tolerate GC at arbitrary decided views. Triggering it on every decide just makes it more prompt.

Behavioral change — more aggressive memory cleanup: GC now fires every view instead of every ~100 blocks. This is good for memory usage but means state is retained for a shorter window. If any component outside the coordinator's gc() sweep relies on accessing data from views older than the last decided view, it would break. The gc() method looks comprehensive (consensus maps, vote collectors, network, storage, VID, blocks, pending fetches, caches).

Posted 2 inline comments:

  • coordinator.rs:580unwrap_or_default() on the GC epoch silently substitutes EpochNumber(0) if the epoch is missing. Worth verifying this branch is unreachable, or adding a warning.
  • message.rs:169-170 — The u64::MAX sentinel is fragile to arithmetic overflow elsewhere, and the existing ViewNumber::new(1) for NotFound/Denied responses will be GC'd almost immediately under the new regime.

Previous review threads: My prior inline comment about removing the Option wrapper on gc is resolved — the entire checkpoint mechanism was removed, making it moot.

Overall this is a well-executed follow-up that addresses the reviewer's request cleanly. The removal is thorough with no orphaned references.

if let Some(newest) = leaves.first() {
let gc_view = newest.view_number();
let gc_epoch = newest.justify_qc().epoch().unwrap_or_default();
self.gc(gc_view, gc_epoch);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unwrap_or_default() silently substitutes EpochNumber(0) when the epoch is missing. Since this controls the GC sweep across vote collectors and the epoch manager (self.epoch_manager.gc(epoch)), an unexpected EpochNumber(0) could either be a no-op or prematurely purge epoch-0 data.

Consider whether this branch can actually fire in practice (e.g. a leaf whose justify_qc has no epoch). If it can't, a warn! + early-return would make the impossibility explicit rather than silently using a sentinel value.

Comment on lines +169 to +170
// TODO: change this once cliquenet has a Slot::MAX for external messages that is not garbage collected
MessageKind::External(_) => ViewNumber::new(u64::MAX),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentinel works for the split_off-based GC (which retains keys >= the GC view), but u64::MAX is brittle — any arithmetic on this view number (view + 1, comparisons that assume a valid range) elsewhere in the stack would silently overflow or behave unexpectedly. As the TODO says, a proper Slot::MAX or an Option<ViewNumber> return would be safer. Worth tracking this.

Also: other MessageKind::Data variants return ViewNumber::new(1) for NotFound/Denied — those will be GC'd immediately once the decided view passes 1. Is that intentional, or should they get a similar "never GC" treatment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants