[backport release-mainnet-1.2.0-rc] Hash id migration#4384
Conversation
* test(persistence): regression test for hash_bigint backfill FK race - Drive UpdateStateData::insert_merkle_nodes_batch (the real write path) for a FeeMerkleTree proof, move the resulting rows to the legacy hash/fee_merkle_tree tables, then issue a second live write at a new block height - The two writes share almost every hash value, so the live writes win the value-UNIQUE constraint in hash_bigint and BackfillHash silently drops the corresponding legacy rows under ON CONFLICT DO NOTHING - Assert that after BackfillHash + BackfillFeeMerkleTree the legacy table is empty and no fee_merkle_tree_bigint row has a dangling hash_id - Fails today with a FK violation, exercising the race that motivated the hash_id translation fix * fix(persistence): translate hash_ids by value in merkle backfill - Replace the verbatim copy of hash_id / children in the merkle tree backfill with an INSERT ... SELECT that joins legacy hash and hash_bigint on `value`, so the destination row references the hash_bigint id regardless of whether BackfillHash preserved the original id - Translate every element of the children JSONB array the same way via jsonb_array_elements_text + jsonb_agg, preserving order and empty arrays - Add a window-existence probe so the batch loop terminates on empty windows without needing to materialize legacy rows in Rust - Extend the regression test with a children-FK assertion so a future regression in child translation surfaces * make query more readable * fix batch termination condition --------- Co-authored-by: ss-es <155648797+ss-es@users.noreply.github.com>
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. |
|
Claude finished @ss-es's task in 4m 31s —— View job Code Review: Hash ID Migration Backport
SummaryThis PR backports the hash ID migration from What works well
Issues flagged (inline comments posted)
Observations (no action needed)
|
| { | ||
| let registry = super::migrations::hash_bigint_migrations(); | ||
| let db = persistence.db.clone(); | ||
| tokio::spawn(registry.run_all_migrations(db)); |
There was a problem hiding this comment.
Potential issue: Dropped JoinHandle silently swallows panics.
The JoinHandle returned by tokio::spawn is discarded. If the migration task panics (e.g., due to an unexpected unwrap() deep in a query path), the panic is silently lost and the migration will never complete — with no log evidence of why.
Consider capturing the handle and spawning a small monitor, or at minimum logging panics:
| tokio::spawn(registry.run_all_migrations(db)); | |
| let handle = tokio::spawn(registry.run_all_migrations(db)); | |
| tokio::spawn(async move { | |
| if let Err(e) = handle.await { | |
| tracing::error!("deferred migration task failed: {e:#}"); | |
| } | |
| }); |
| let mut tx = match db.write().await { | ||
| Ok(tx) => tx, | ||
| Err(e) => { | ||
| tracing::error!(name, offset, "failed to open write transaction: {e:#}"); | ||
| return false; | ||
| }, | ||
| }; | ||
|
|
||
| let next = match m.run_batch(&mut tx, offset).await { | ||
| Ok(next) => next, | ||
| Err(e) => { | ||
| tracing::error!(name, offset, "migration batch failed: {e:#}"); | ||
| return false; | ||
| }, |
There was a problem hiding this comment.
No retry limit — permanently failing migrations loop forever.
If a migration consistently fails (e.g., data corruption, schema mismatch), run_all_migrations will retry every 5 minutes indefinitely with no circuit breaker. For a production node, this generates noise in logs with no operator escalation.
Consider adding a max consecutive failure count per migration, after which the error is logged at a higher severity or the migration is marked as failed in the deferred_migrations.error column (which currently exists in the schema but is never written to).
| tokio::time::sleep(delay).await; | ||
| } | ||
|
|
||
| if batch_count.is_multiple_of(m.log_frequency()) { |
There was a problem hiding this comment.
Nit: usize::is_multiple_of() requires Rust 1.87+.
This was stabilized very recently (May 2025). The repo uses channel = "stable" so it should be fine on up-to-date toolchains, but worth being aware of if CI or any build targets lag behind. A simple batch_count % m.log_frequency() == 0 is equivalent and universally supported.
| nodes | ||
| } else { | ||
| let legacy_table = state_type.strip_suffix("_bigint").unwrap_or(state_type); | ||
| let mut query = QueryBuilder::default(); | ||
| query.bind(created)?; | ||
| let mut sub_queries = Vec::with_capacity(missing.len()); | ||
| for path in missing { | ||
| let param = query.bind(path)?; | ||
| // Parentheses are required: UNION operands cannot contain ORDER BY/LIMIT | ||
| // without them. | ||
| sub_queries.push(format!( | ||
| "(SELECT path, created, hash_id::BIGINT AS hash_id, children, \ | ||
| children_bitvec, idx, entry FROM {legacy_table} WHERE path = {param} AND \ | ||
| created <= $1 ORDER BY created DESC LIMIT 1)" | ||
| )); | ||
| } | ||
| let sql = format!( | ||
| "SELECT * FROM ({}) AS t ORDER BY t.path DESC", | ||
| sub_queries.join(" UNION ") | ||
| ); | ||
| let legacy_rows = | ||
| query | ||
| .query(&sql) | ||
| .fetch_all(self.as_mut()) | ||
| .await | ||
| .map_err(|e| QueryError::Error { | ||
| message: format!("merkle path fallback lookup failed: {e}"), | ||
| })?; | ||
| let mut nodes = nodes; | ||
| nodes.extend(legacy_rows.into_iter().map(Node::from)); | ||
| // Sort leaf-first (longer paths first). | ||
| nodes.sort_by_key(|n| { | ||
| std::cmp::Reverse(n.path.as_array().map(|v| v.len()).unwrap_or(0)) | ||
| }); | ||
| nodes | ||
| } |
There was a problem hiding this comment.
Correctness observation: Legacy fallback children contain legacy hash.id values, not hash_bigint.id values.
When reading from the legacy table, children JSONB still holds old INT hash IDs. The hash lookup below correctly falls back from hash_bigint to hash for missing IDs (lines 156-173), so this works.
However, note that after CleanupLegacyHashTable completes and deletes all rows from hash, any legacy merkle rows that somehow still exist would have unresolvable children hash IDs. This is safe only because the merkle backfill deletes legacy rows before the hash cleanup runs. Worth a brief comment noting this invariant, since the correctness of this fallback depends on the backfill ordering.
| -- Drop reward merkle tree tables — unused and always empty across all deployments. | ||
| DROP TABLE reward_merkle_tree; |
There was a problem hiding this comment.
Minor: DROP TABLE without IF EXISTS could fail on fresh or partially-migrated databases.
If for any reason a node's database doesn't have these tables (e.g., a schema was manually modified or an earlier migration was partially applied), this migration will fail hard. Using DROP TABLE IF EXISTS would be more defensive. That said, since these tables are created by earlier numbered migrations that must have run before V1302, the practical risk is low.
| started_at TIMESTAMPTZ NOT NULL, | ||
| completed_at TIMESTAMPTZ, | ||
| error TEXT, |
There was a problem hiding this comment.
The error column is defined here but never written to by the migration runner (MigrationRegistry).
The run_migration method logs errors via tracing::error! but never persists the error message to this column. Either populate it on failure (useful for debugging on nodes without log access) or remove the column to avoid confusion.
If keeping it: writing the error in run_migration when returning false would be a natural place:
// After logging the error, persist it:
sqlx::query("UPDATE deferred_migrations SET error = $1 WHERE name = $2")
.bind(format!("{e:#}"))
.bind(name)
.execute(db.pool()) // or open a new transaction
...This reverts commit f0d0519.
See #4325