Skip to content

bridge: implement additional methods for self_heal#584

Merged
marcello33 merged 4 commits into
developfrom
mardizzone/self-heal_improv
May 19, 2026
Merged

bridge: implement additional methods for self_heal#584
marcello33 merged 4 commits into
developfrom
mardizzone/self-heal_improv

Conversation

@marcello33
Copy link
Copy Markdown
Collaborator

@marcello33 marcello33 commented May 5, 2026

Summary

Extends the bridge self-heal loop to cover the full set of nonce-gated stake events emitted by the L1 StakingInfo contract: StakeUpdate (already covered), plus SignerChange and UnstakeInit (new). The three events share a single per-validator nonce counter, so recovery is unified into one combined GraphQL query that returns the max L1 nonce across all three entities, then fetches and replays the missing event when Heimdall lags.

Also, this PR fixes the buggy bridge related metrics which were not reporting correctly.

Operator note: metric rename

The three self-heal Prometheus metrics have been renamed and moved into the heimdallv2 namespace to match the rest of heimdall's metrics and follow Prometheus naming convention (snake_case + _total suffix). The previous mixed-case names were silently filtered out by Datadog/Grafana queries.

Before After
self_healing_<chain>_StakeUpdate heimdallv2_self_healing_stake_events_processed_total
self_healing_<chain>_StateSynced heimdallv2_self_healing_state_syncs_processed_total
self_healing_<chain>_NewHeaderBlock heimdallv2_self_healing_checkpoint_acks_processed_total

Update dashboards / alerting rules accordingly at deploy time.

Subgraph dependency

The new query path requires signerChanges and unstakeInits entities in the deployed Polygon PoS subgraph, added here

Cross-chain validation hardening

getStakeEventLogByNonce now validates the L1 receipt before trusting the log:

  • receipt.Status == ReceiptStatusSuccessful (rejects reverted txs)
  • log.Address == StakingInfo address from ChainManager params (rejects logs from unexpected contracts)

Rollout

  • pos-ops sub_graph_url is already switched to v0.0.2, so this PR is deployment-ready.

Test plan

  • Deploy + monitor on Amoy first; confirm Self-healing: retrieved nonces for validator lines in logs and no subgraph returned errors.
  • Verify Prometheus metric heimdallv2_self_healing_stake_events_processed_total exposed and incrementing on a synthetic stuck nonce.
  • Mainnet deploy after Amoy soak.

Copilot AI review requested due to automatic review settings May 5, 2026 16:49
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@marcello33
Copy link
Copy Markdown
Collaborator Author

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 5, 2026

Claude finished @marcello33's task in 2m 29s —— View job


Code Review: bridge self_heal improvements

  • Gather PR diff and context
  • Review bridge/listener/rootchain_selfheal.go
  • Review bridge/listener/rootchain_selfheal_graph.go
  • Review bridge/listener/rootchain_selfheal_test.go
  • Review bridge/processor/stake.go and bridge/processor/stake_test.go
  • Review bridge/util/common.go
  • Review documentation changes
  • Post comprehensive review

Summary

Good refactor that generalizes the self-heal loop from handling only StakeUpdate events to all three nonce-gated stake event types (StakeUpdate, SignerChange, UnstakeInit). The combined subgraph query is efficient (single round-trip), the per-cycle bound (maxStakeNoncesPerCycle = 10) prevents runaway work, and the new GraphQL error handling is a welcome hardening. Tests are thorough for the utility/subgraph layer.

I found one logic bug, one misleading log, a documentation inaccuracy, and a few minor items.


Issues

1. Bug: misleading "remaining" log when loop breaks early on failure

rootchain_selfheal.go:227

if remaining := l1MaxNonce - heimdallNonce - queued; remaining > 0 {
    rl.Logger.Warn("Self-healing: stake event backlog exceeds per-cycle limit; ...")
}

The loop (line 217-225) breaks early when replayStakeEvent returns false — which happens on subgraph fetch errors, processEvent errors, or events too recent for SHMaxDepthDuration. In those cases queued will be less than maxStakeNoncesPerCycle, but the log message claims "backlog exceeds per-cycle limit", which is false. The operator would misinterpret a subgraph outage as a nonce backlog problem.

Suggestion: Distinguish the two exit reasons:

hitLimit := queued >= maxStakeNoncesPerCycle
total := l1MaxNonce - heimdallNonce
if hitLimit && total > queued {
    rl.Logger.Warn("Self-healing: stake event backlog exceeds per-cycle limit; will continue next cycle", ...)
}

No log is needed on the failure path — replayStakeEvent already logged the error.

Fix this →


2. Documentation: incorrect event names in x/stake/README.md

x/stake/README.md:975

missed StakeUpdate, SignerUpdate, and ValidatorExit events are replayed automatically

The L1 events are StakeUpdate, SignerChange, and UnstakeInit. "SignerUpdate" and "ValidatorExit" are the Heimdall message names (MsgSignerUpdate, MsgValidatorExit), not the event names. Since this section describes what the self-heal loop queries from the L1 subgraph, using the L1 event names would be more accurate and consistent with the code (signerChanges, unstakeInits).

Fix this →


3. Breaking change: Prometheus metric rename

rootchain_selfheal.go:34

The metric name changed from StakeUpdate to StakeEvent. Any existing Grafana dashboards or alerting rules that reference the old self_healing_<chain>_StakeUpdate metric will silently stop matching. This is fine if intentional, but should be called out in the PR description so operators can update their dashboards at deploy time.


4. Minor: findLogByIndex could deduplicate existing code

The new findLogByIndex helper (rootchain_selfheal_graph.go:369-376) is only used in getStakeEventLogByNonce, but the same inline pattern appears in:

  • processCheckpointAck (rootchain_selfheal.go:156-163)
  • getStateSynced (rootchain_selfheal_graph.go:213-218)

Not required for this PR, but these could be refactored to use the new helper for consistency.


5. Minor: querySubGraph doesn't check HTTP status code

rootchain_selfheal_graph.go:95-105

Pre-existing, but worth noting since this PR adds new callers: querySubGraph reads the body regardless of the HTTP status code. A 500 response body may not be JSON, which would surface as an unmarshal error rather than a clear "subgraph returned HTTP 500". Consider adding a status check:

if response.StatusCode != http.StatusOK {
    body, _ := io.ReadAll(response.Body)
    return nil, fmt.Errorf("self-healing: subgraph returned HTTP %d: %s", response.StatusCode, body)
}

Positive observations

  • Combined GraphQL query (getMaxL1NonceForValidator) fetches all three event types in a single round-trip — efficient and correct given the shared nonce counter.
  • GraphQL error handling added consistently across all subgraph query paths. The joinGraphQLErrors helper and fail-closed semantics (return error even when data is present alongside errors) are the right call, especially for detecting schema drift if the subgraph doesn't yet index signerChanges/unstakeInits.
  • strconv.FormatUint used correctly for uint64 values in the new code, avoiding the strconv.Itoa(int(...)) pattern that could theoretically overflow on 32-bit systems (pre-existing in older code).
  • maxStakeNoncesPerCycle bound with pauseBetweenReplays pacing is well-designed — prevents flooding the bridge processor while still making forward progress.
  • Test coverage is solid for the pure functions (maxNonceFromResponse, pickStakeEventHit, findLogByIndex, joinGraphQLErrors) and the subgraph interaction layer. The httptest-based approach for getMaxL1NonceForValidator and getStakeEventLogByNonce is clean and covers error, empty, and GraphQL-error paths.
  • Centralizing StakeNonceRetryDelay in bridge/util/common.go is the right move — keeps the self-heal pacer and the processor retry delay in sync.

|

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 65.19337% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.0%. Comparing base (fcf72a4) to head (62336c5).

Files with missing lines Patch % Lines
bridge/listener/rootchain_selfheal.go 53.3% 32 Missing and 3 partials ⚠️
bridge/listener/rootchain_selfheal_graph.go 76.4% 23 Missing and 1 partial ⚠️
bridge/processor/stake.go 0.0% 4 Missing ⚠️

❌ Your patch check has failed because the patch coverage (65.1%) is below the target coverage (90.0%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop    #584     +/-   ##
=========================================
+ Coverage     42.1%   43.0%   +0.8%     
=========================================
  Files          172     172             
  Lines        17722   17823    +101     
=========================================
+ Hits          7476    7678    +202     
+ Misses        9167    9044    -123     
- Partials      1079    1101     +22     
Files with missing lines Coverage Δ
bridge/util/common.go 25.1% <ø> (+3.8%) ⬆️
bridge/processor/stake.go 0.9% <0.0%> (ø)
bridge/listener/rootchain_selfheal_graph.go 66.3% <76.4%> (+66.3%) ⬆️
bridge/listener/rootchain_selfheal.go 20.6% <53.3%> (+20.6%) ⬆️

... and 2 files with indirect coverage changes

Files with missing lines Coverage Δ
bridge/util/common.go 25.1% <ø> (+3.8%) ⬆️
bridge/processor/stake.go 0.9% <0.0%> (ø)
bridge/listener/rootchain_selfheal_graph.go 66.3% <76.4%> (+66.3%) ⬆️
bridge/listener/rootchain_selfheal.go 20.6% <53.3%> (+20.6%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread bridge/listener/rootchain_selfheal_test.go
Comment thread bridge/listener/rootchain_selfheal.go Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 5, 2026

Code Review

Two cross-chain security issues found in the newly added getStakeEventLogByNonce() function in bridge/listener/rootchain_selfheal_graph.go (L336-L349). Both relate to L1 receipt validation rules from .claude/rules/cross-chain.md.


Issue 1: Missing receipt.Status check before trusting event logs

File: bridge/listener/rootchain_selfheal_graph.go L337-L342 (

receipt, err := rl.contractCaller.MainChainClient.TransactionReceipt(ctx, common.HexToHash(hit.TransactionHash))
if err != nil {
return nil, err
}
log := findLogByIndex(receipt.Logs, hit.LogIndex)
)

After fetching the L1 transaction receipt, the code extracts logs without verifying the transaction succeeded. A reverted transaction can still carry logs, and a compromised subgraph endpoint could return a tx hash for a failed transaction.

Rule violated (cross-chain.md, L1 Receipt Validation): Check receipt.Status == 1 (success) before trusting event logs -- reverted txs still emit events in some edge cases

Red Flag (cross-chain.md): Trusting event data without receipt status check

Suggested fix: Add a receipt.Status != 1 check after fetching the receipt, returning an error if the transaction reverted.


Issue 2: Missing log.Address verification against expected StakingInfo contract

File: bridge/listener/rootchain_selfheal_graph.go L342-L348 (

log := findLogByIndex(receipt.Logs, hit.LogIndex)
if log == nil {
return nil, fmt.Errorf("self-healing: no log found for log index %s, validator %d and nonce %d", hit.LogIndex, validatorId, nonce)
}
rl.Logger.Info("Self-healing: retrieved stake event log from Ethereum", "validatorId", validatorId, "nonce", nonce, "txHash", log.TxHash.Hex())
return log, nil
)

The log is selected solely by its numeric index from the receipt. It is never verified that log.Address matches the expected StakingInfo contract address. A single transaction can emit events from multiple contracts, and the subgraph (an external data source) supplies both the tx hash and log index.

The normal (non-self-heal) listener path uses ethereum.FilterQuery with explicit contract addresses which filters at the RPC level. The self-heal path bypasses that by fetching receipts directly.

Rules violated:

  • cross-chain.md (Bridge Event Processing): Listeners must filter events by exact contract address
  • cross-chain.md (L1 Receipt Validation): Verify log indices -- a single transaction can emit multiple events
  • p2p-and-networking.md (Bridge Listeners): Self-heal mechanisms must validate recovered events with the same strictness as live events -- weaker validation in self-heal is a backdoor

Suggested fix: After finding the log with findLogByIndex, verify log.Address matches the expected StakingInfo contract address from ChainManager params.


Note: The pre-existing getStateSynced() function in the same file has the same gaps, but these are new-code violations that should be addressed in this PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@marcello33
Copy link
Copy Markdown
Collaborator Author

@claude addressed all your points. Review again.

@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Comment thread bridge/listener/rootchain_selfheal_test.go
Comment thread bridge/listener/rootchain_selfheal.go
Copilot AI review requested due to automatic review settings May 6, 2026 13:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comment thread metrics/selfheal.go
Comment thread bridge/listener/rootchain_selfheal_test.go
Comment thread x/stake/README.md
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Comment thread bridge/listener/rootchain_selfheal.go
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Comment thread x/stake/README.md
@marcello33 marcello33 changed the title [WIP] bridge: implement additional methods for self_heal bridge: implement additional methods for self_heal May 7, 2026
@marcello33 marcello33 merged commit c47e394 into develop May 19, 2026
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants