[backport release-mainnet-1.2.0-rc] Hash id migration by ss-es · Pull Request #4384 · EspressoSystems/espresso-network

ss-es · 2026-05-25T14:41:57Z

* test(persistence): regression test for hash_bigint backfill FK race - Drive UpdateStateData::insert_merkle_nodes_batch (the real write path) for a FeeMerkleTree proof, move the resulting rows to the legacy hash/fee_merkle_tree tables, then issue a second live write at a new block height - The two writes share almost every hash value, so the live writes win the value-UNIQUE constraint in hash_bigint and BackfillHash silently drops the corresponding legacy rows under ON CONFLICT DO NOTHING - Assert that after BackfillHash + BackfillFeeMerkleTree the legacy table is empty and no fee_merkle_tree_bigint row has a dangling hash_id - Fails today with a FK violation, exercising the race that motivated the hash_id translation fix * fix(persistence): translate hash_ids by value in merkle backfill - Replace the verbatim copy of hash_id / children in the merkle tree backfill with an INSERT ... SELECT that joins legacy hash and hash_bigint on `value`, so the destination row references the hash_bigint id regardless of whether BackfillHash preserved the original id - Translate every element of the children JSONB array the same way via jsonb_array_elements_text + jsonb_agg, preserving order and empty arrays - Add a window-existence probe so the batch loop terminates on empty windows without needing to materialize legacy rows in Rust - Extend the regression test with a children-FK assertion so a future regression in child translation surfaces * make query more readable * fix batch termination condition --------- Co-authored-by: ss-es <155648797+ss-es@users.noreply.github.com>

gemini-code-assist · 2026-05-25T14:42:02Z

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

claude · 2026-05-25T14:42:16Z

Claude finished @ss-es's task in 4m 31s —— View job

Code Review: Hash ID Migration Backport

Read repository guidelines and existing review comments
Analyze PR diff and understand the changes
Review migration SQL files
Review Rust implementation changes
Review test changes
Post detailed review feedback

Summary

This PR backports the hash ID migration from INT to BIGINT for Merkle tree storage tables. The approach is well-designed: an expand-migrate-cleanup strategy that maintains read availability throughout the migration via legacy table fallbacks. The core architecture is sound.

What works well

Translation by value, not ID — The merkle tree backfill joins on hash.value = hash_bigint.value instead of preserving legacy IDs. This cleanly avoids FK races when live writes claim the same hash value with a new auto-increment ID. The regression test (backfill_preserves_fk_when_live_write_shares_hash_value) is excellent and directly covers this subtle race.
Move-not-copy strategy — Deleting legacy rows in the same transaction as the insert keeps storage roughly flat during migration. Good for production databases with large merkle trees.
Read fallback with dual-table lookup — The get_path fallback to legacy tables (with hash lookup falling back from hash_bigint → hash) ensures queries work correctly at every point during the migration window.
Configurable batch delay via ESPRESSO_NODE_BACKFILL_BATCH_DELAY_MS gives operators a knob to tune migration pressure.
Well-structured MigrationRegistry with dependency validation, checkpoint/resume, and good test coverage (unit tests for validation + integration tests against real postgres).

Issues flagged (inline comments posted)

Severity	File	Issue
Medium	`persistence/sql.rs:680`	Dropped `JoinHandle` silently swallows panics from the migration task
Medium	`migration.rs:186-199`	No max retry / circuit breaker — permanently failing migrations loop every 5 min forever
Low	`V1403__deferred_migrations.sql`	`error` column is defined in schema but never written to by the migration runner
Low	`migration.rs:233`	`is_multiple_of()` requires Rust 1.87+ (very recent stabilization)
Info	`queries/state.rs:86-121`	Legacy fallback correctness depends on backfill ordering invariant — worth documenting
Info	`V1302__hash_bigint_expand.sql:7-8`	`DROP TABLE` without `IF EXISTS` for reward tables

Observations (no action needed)

The removal of migrate_reward_merkle_tree_v2 from SequencerPersistence and the associated test cleanup is clean. The reward merkle tree tables (reward_merkle_tree, reward_merkle_tree_v2) are dropped in V1302 since they're unused.
The SQLite migration path (V1302__hash_bigint_rename.sql) correctly does a rename-only operation since SQLite integers are already 64-bit — nice touch.
The POS_V3 test case removal from test_state_reconstruction and test_reward_proof_endpoint is consistent with the reward merkle tree v1 write path being removed from store_state_update.
The new migration-status API endpoint is a helpful operational tool for monitoring backfill progress.

claude · 2026-05-25T14:46:11Z

+        {
+            let registry = super::migrations::hash_bigint_migrations();
+            let db = persistence.db.clone();
+            tokio::spawn(registry.run_all_migrations(db));


Potential issue: Dropped JoinHandle silently swallows panics.

The JoinHandle returned by tokio::spawn is discarded. If the migration task panics (e.g., due to an unexpected unwrap() deep in a query path), the panic is silently lost and the migration will never complete — with no log evidence of why.

Consider capturing the handle and spawning a small monitor, or at minimum logging panics:

Suggested change

tokio::spawn(registry.run_all_migrations(db));

let handle = tokio::spawn(registry.run_all_migrations(db));

tokio::spawn(async move {

if let Err(e) = handle.await {

tracing::error!("deferred migration task failed: {e:#}");

}

});

claude · 2026-05-25T14:46:14Z

+            let mut tx = match db.write().await {
+                Ok(tx) => tx,
+                Err(e) => {
+                    tracing::error!(name, offset, "failed to open write transaction: {e:#}");
+                    return false;
+                },
+            };
+
+            let next = match m.run_batch(&mut tx, offset).await {
+                Ok(next) => next,
+                Err(e) => {
+                    tracing::error!(name, offset, "migration batch failed: {e:#}");
+                    return false;
+                },


No retry limit — permanently failing migrations loop forever.

If a migration consistently fails (e.g., data corruption, schema mismatch), run_all_migrations will retry every 5 minutes indefinitely with no circuit breaker. For a production node, this generates noise in logs with no operator escalation.

Consider adding a max consecutive failure count per migration, after which the error is logged at a higher severity or the migration is marked as failed in the deferred_migrations.error column (which currently exists in the schema but is never written to).

claude · 2026-05-25T14:46:17Z

+                tokio::time::sleep(delay).await;
+            }
+
+            if batch_count.is_multiple_of(m.log_frequency()) {


Nit: usize::is_multiple_of() requires Rust 1.87+.

This was stabilized very recently (May 2025). The repo uses channel = "stable" so it should be fine on up-to-date toolchains, but worth being aware of if CI or any build targets lag behind. A simple batch_count % m.log_frequency() == 0 is equivalent and universally supported.

claude · 2026-05-25T14:46:23Z

+                nodes
+            } else {
+                let legacy_table = state_type.strip_suffix("_bigint").unwrap_or(state_type);
+                let mut query = QueryBuilder::default();
+                query.bind(created)?;
+                let mut sub_queries = Vec::with_capacity(missing.len());
+                for path in missing {
+                    let param = query.bind(path)?;
+                    // Parentheses are required: UNION operands cannot contain ORDER BY/LIMIT
+                    // without them.
+                    sub_queries.push(format!(
+                        "(SELECT path, created, hash_id::BIGINT AS hash_id, children, \
+                         children_bitvec, idx, entry FROM {legacy_table} WHERE path = {param} AND \
+                         created <= $1 ORDER BY created DESC LIMIT 1)"
+                    ));
+                }
+                let sql = format!(
+                    "SELECT * FROM ({}) AS t ORDER BY t.path DESC",
+                    sub_queries.join(" UNION ")
+                );
+                let legacy_rows =
+                    query
+                        .query(&sql)
+                        .fetch_all(self.as_mut())
+                        .await
+                        .map_err(|e| QueryError::Error {
+                            message: format!("merkle path fallback lookup failed: {e}"),
+                        })?;
+                let mut nodes = nodes;
+                nodes.extend(legacy_rows.into_iter().map(Node::from));
+                // Sort leaf-first (longer paths first).
+                nodes.sort_by_key(|n| {
+                    std::cmp::Reverse(n.path.as_array().map(|v| v.len()).unwrap_or(0))
+                });
+                nodes
+            }


Correctness observation: Legacy fallback children contain legacy hash.id values, not hash_bigint.id values.

When reading from the legacy table, children JSONB still holds old INT hash IDs. The hash lookup below correctly falls back from hash_bigint to hash for missing IDs (lines 156-173), so this works.

However, note that after CleanupLegacyHashTable completes and deletes all rows from hash, any legacy merkle rows that somehow still exist would have unresolvable children hash IDs. This is safe only because the merkle backfill deletes legacy rows before the hash cleanup runs. Worth a brief comment noting this invariant, since the correctness of this fallback depends on the backfill ordering.

claude · 2026-05-25T14:46:27Z

+-- Drop reward merkle tree tables — unused and always empty across all deployments.
+DROP TABLE reward_merkle_tree;


Minor: DROP TABLE without IF EXISTS could fail on fresh or partially-migrated databases.

If for any reason a node's database doesn't have these tables (e.g., a schema was manually modified or an earlier migration was partially applied), this migration will fail hard. Using DROP TABLE IF EXISTS would be more defensive. That said, since these tables are created by earlier numbered migrations that must have run before V1302, the practical risk is low.

claude · 2026-05-25T14:46:32Z

+    started_at   TIMESTAMPTZ NOT NULL,
+    completed_at TIMESTAMPTZ,
+    error        TEXT,


The error column is defined here but never written to by the migration runner (MigrationRegistry).

The run_migration method logs errors via tracing::error! but never persists the error message to this column. Either populate it on failure (useful for debugging on nodes without log access) or remove the column to avoid confusion.

If keeping it: writing the error in run_migration when returning false would be a natural place:

// After logging the error, persist it: sqlx::query("UPDATE deferred_migrations SET error = $1 WHERE name = $2") .bind(format!("{e:#}")) .bind(name) .execute(db.pool()) // or open a new transaction ...

This reverts commit f0d0519.

…ration-1.2.0

ss-es added 30 commits May 25, 2026 10:39

add migration + trait

9132801

fix

b0ed7c8

fix

e99ee1c

fix

dca3635

run migrations on node startup

dfbbc86

fix

89ede6d

remove unused reward_merkle_trees

2c2934e

refactor?

5c02df3

fix

63c1e6e

fix

038de43

remove v3 tests

910f7fe

cleanup

2bea604

update

ac322b1

remove more dead tests

945f808

cleanup gating

402ad04

clippy

784cf19

more cleanup

4049af6

reorganize

bbcd29b

use unnest for batching

6d794d7

page backfill

a8f43c8

simplify query fallback

f3db054

cleanup

27e5cad

more simplification

ef8dc5d

add tests

a823f07

fix?

2cda1a3

add 50ms sleep between batches

9d6763c

make delay configurable

5bfb3d7

mitigate storage increase during migration

5ef0bf9

add migration progress endpoint

6a54800

fix

424e78f

ss-es requested review from bfish713, imabdulbasit, jbearer, lukaszrzasik and sveitser as code owners May 25, 2026 14:41

claude Bot reviewed May 25, 2026

View reviewed changes

ss-es and others added 9 commits May 25, 2026 11:42

fix migrations

8647739

update genesis (fix?)

9461a0f

try fixing slow query?

68adf63

fix attempt 2

f0d0519

Revert "fix attempt 2"

2c1f93a

This reverts commit f0d0519.

Merge branch 'release-mainnet-1.2.0-rc' into release-test-hash-id-mig…

dc85dab

…ration-1.2.0

update to LATERAL instead of UNION

6df71ba

Merge branch 'release-mainnet-1.2.0-rc' into release-test-hash-id-mig…

d4974cf

…ration-1.2.0

add retries + sqlite pragmas for light client database

63c74fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backport release-mainnet-1.2.0-rc] Hash id migration#4384

[backport release-mainnet-1.2.0-rc] Hash id migration#4384
ss-es wants to merge 40 commits into
release-mainnet-1.2.0-rcfrom
release-test-hash-id-migration-1.2.0

ss-es commented May 25, 2026

Uh oh!

gemini-code-assist Bot commented May 25, 2026

Uh oh!

claude Bot commented May 25, 2026 •

edited

Loading

Uh oh!

claude Bot May 25, 2026

Uh oh!

claude Bot May 25, 2026

Uh oh!

claude Bot May 25, 2026

Uh oh!

claude Bot May 25, 2026

Uh oh!

claude Bot May 25, 2026

Uh oh!

claude Bot May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            tokio::spawn(registry.run_all_migrations(db));
+            let handle = tokio::spawn(registry.run_all_migrations(db));
+            tokio::spawn(async move {
+                if let Err(e) = handle.await {
+                    tracing::error!("deferred migration task failed: {e:#}");
+                }
+            });

		-- Drop reward merkle tree tables — unused and always empty across all deployments.
		DROP TABLE reward_merkle_tree;

Conversation

ss-es commented May 25, 2026

Uh oh!

gemini-code-assist Bot commented May 25, 2026

Uh oh!

claude Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Hash ID Migration Backport

Summary

What works well

Issues flagged (inline comments posted)

Observations (no action needed)

Uh oh!

claude Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude Bot commented May 25, 2026 •

edited

Loading