Skip to content

feat!: per-part chain decoupling (chain_id required per trigger/node)#636

Merged
chrisli30 merged 14 commits into
mainfrom
staging
Jun 26, 2026
Merged

feat!: per-part chain decoupling (chain_id required per trigger/node)#636
chrisli30 merged 14 commits into
mainfrom
staging

Conversation

@chrisli30

Copy link
Copy Markdown
Member

HELD — do not merge until the SDK/studio client update ships. Strict prod rejects chain-less createWorkflow/runNode calls; merging ahead of the client would break in-flight requests. Draft until lockstep.

Release bundle (staging → main)

Headline — chain decoupling (BREAKING): a workflow no longer belongs to a chain. Chain lives only on each chain-aware trigger/node (event/block triggers; contractRead/contractWrite/ethTransfer nodes), where chain_id is now required. (#630#634)

  • Per-part chain flow + operator routing by the trigger's own chain
  • Task.chain_id / CreateWorkflowRequest.chainId removed; task storage is chain-agnostic (t:<status>:<id>)
  • chain_id <= 0 rejected at create and run (no aggregator-default fallback)
  • OpenAPI renovated: workflow-level chainId removed, per-part chainId required, list/count/executions are chain-agnostic

Also bundled:

Storage migration (required, automatic)

The storage key template changed (t:<chain>:… / u:<chain>:… / history:<chain>:… → chain-agnostic). make storage-check intentionally exits 1 flagging the removed u:%d:%s templates — that is expected; the migration is the handler:

  • 20260626-wipe-chain-bucketed-task-keys (registered in migrations/migrations.go) runs at boot via agg.migrator.Run() after a full DB backup, deleting orphaned old-schema rows (any t: / u: / history: key with a numeric chain segment). Idempotent; recorded so it runs once. Tested.
  • No backfill — the few deployed chain-bucketed tasks are cleared; owners re-create with explicit per-part chains.

Pre-merge checklist

  • make storage-check reviewed (breaking flag expected; migration covers it)
  • Wipe migration registered + runs at boot after backup + tested
  • SDK (ava-sdk-js) shipped sending per-part chains (regen off this spec)
  • studio Phase 3 shipped (per-part chains; drop workflow-level chainId)
  • Merge once the above land, in lockstep

BREAKING CHANGE: chain_id is now required on every chain-aware trigger/node; the workflow-level chainId (Task.chain_id, CreateWorkflowRequest.chainId) is removed and task storage is chain-agnostic. Clients must send explicit per-part chains. A boot migration deletes pre-existing chain-bucketed tasks.

🤖 Generated with Claude Code

chrisli30 and others added 13 commits June 23, 2026 16:59
…n-state fix & RPC E2E proof) (#624)

Co-authored-by: Antriksh Gwal <antrikshgwal@gmail.com>
Co-authored-by: Will Zimmerman <will@avaprotocol.org>
…dler URLs resolve

The renamed config/test.yaml references the bundler endpoints via
${SEPOLIA_BUNDLER_URL} / ${BASE_SEPOLIA_BUNDLER_URL}, which the config
loader expands from the process environment. testutil only loaded .env,
and both dotenv files are gitignored — so in a fresh git worktree the vars
were unset, expanded to empty, and bundler-dependent tests panicked with
"BundlerURL is empty in test.yaml config".

LoadDotEnv now loads .env.local then .env (local overrides, real env wins
over both), logs when neither is present so a missing dotenv in a worktree
is no longer a silent mystery, and factors single-file parsing into
loadDotEnvFile.
…th B) (#626)

Co-authored-by: Will Zimmerman <will@avaprotocol.org>
SendGrid handling was removed in #626; the empty-body case now only applies
to the Studio /api/notify provider. Per Copilot review on #627.
…th (Path B only) (#628)

Co-authored-by: Will Zimmerman <will@avaprotocol.org>
#628) (#629)

Co-authored-by: Will Zimmerman <will@avaprotocol.org>
…er/node chains (#630)

Co-authored-by: Will Zimmerman <will@avaprotocol.org>
…required per part (#631)

Co-authored-by: Will Zimmerman <will@avaprotocol.org>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Chris Li <chris@avaprotocol.org>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#633)

Co-authored-by: Chris Li <chris@avaprotocol.org>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…unblocks SDK regen) (#634)

Co-authored-by: Chris Li <chris@avaprotocol.org>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Backend chain decoupling is fully merged (#630#634): per-part chain flow + G5
(task chain removed, chain-agnostic storage, wipe migration), strict reject-0,
create-path verification, and the OpenAPI renovation (workflow-level chainId
removed, per-part chainId required). Updated the checklist: SDK Track-B is handed
off to the SDK project (the renovated staging spec unblocks their regen), and the
only remaining backend item is the staging→main promotion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…) before promotion

main carried an unsynced release (#627 Path B /api/notify + erc20_overrides, plus
version bumps v3.11.0 / v3.12.0) that never came back to staging, so the
staging→main PR (#636) conflicted. Back-merge main into staging to realign.

Conflict resolutions:
- summarizer_deterministic.go, summarizer_run_node_test.go: kept DELETED — staging
  removed the deterministic summarizer in #628; main only modified them.
- vm_runner_rest.go, summarizer_context_memory.go: kept STAGING's Path-B-only
  notify rewrite (#628/#629) — it supersedes main's older #627 notify code
  (removed the duplicate BuildNotifyPayload, legacy telegram branch, summaryForClient).
- run_node_immediately_rpc_test.go: kept staging's explicit per-node ChainId (G5 strict).

Net effect on staging: brings in main's version bump (3.10.4 → 3.12.0); everything
else from #627 was already present (#624/#626) or superseded. Build + affected
suites green.
@chrisli30

Copy link
Copy Markdown
Member Author

@claude, eye on this PR.

@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown

Claude finished @chrisli30's task in 5m 3s —— View job


PR Review — feat!: per-part chain decoupling

Todo list:

  • Read PR context and understand scope
  • Review storage migration (migrations/wipe_chain_bucketed_task_keys.go)
  • Review core engine changes (core/taskengine/engine.go)
  • Review VM changes (core/taskengine/vm.go)
  • Review proto/API changes
  • Review operator routing changes
  • Review summarizer removals
  • Review test coverage
  • Final synthesis and findings

Overall Assessment

This is a large, well-planned breaking change. The architecture is sound, the migration is correct, and the test coverage for the new behaviors is good. A few concrete issues worth addressing before merge, plus some observability notes.


Migration (migrations/wipe_chain_bucketed_task_keys.go)

Correct. The SplitN(..., 3) approach cleanly separates old t:11155111:a:<id> (part[1] = integer) from new t:a:<id> (part[1] = status letter). All three namespaces (t:, u:, history:) are handled, copy-before-retain is correct, and the test covers old rows, new rows, unrelated namespaces, and idempotency.


Bug: Executor wallet validation uses defaultChainID() — asymmetry with create-time path

executor.go:453

isValid, err := x.validateWalletOwnership(ctx, x.engine.defaultChainID(), user, smartWalletAddr)

At create time, CreateWorkflow correctly calls userOwnsWalletOnAnyChain (which scans every configured chain). At execution time, validateWalletOwnership is called with a single chain — the gateway default. validateWalletOwnership then derives the wallet address via that chain's factory (executor.go:900) and checks DB/on-chain ownership for that chainID only.

For a multi-chain deployment where the smart wallet factory differs between chains, a wallet legitimately registered on chain 8453 but with defaultChainID() = 1 would pass create-time validation and then fail execution-time validation → task auto-disabled after consecutive failures. The comment at executor.go:898 even warns about this:

"Must use the task's chain config — deriving against the gateway's default chain factory produces a different wallet than the task's chain factory and the comparison will spuriously miss."

The comment pre-dates G5 and now describes exactly the failure mode introduced by using defaultChainID().

Same issue applies at:

  • executor.go:475 — wallet salt lookup: GetWallet(x.db, x.engine.defaultChainID(), owner, ...) reads w:<defaultChain>:<owner>:<wallet> which won't exist if the wallet was registered on a different chain → silent miss → smart wallet auto-deployed without the stored salt
  • executor.go:536 — fee credit: x.feeLedger.CheckCreditLimit(x.engine.defaultChainID(), taskOwner, ...) — ledger is chain-scoped; wrong chain → wrong credit balance checked

Suggestion: Mirror the create-time pattern — check against the trigger's monitoring chain, or add an executorWalletOwnerOnAnyChain helper analogous to userOwnsWalletOnAnyChain. Fix this →


Minor: stampNodeChainIfUnset doesn't cover Loop node inner runners

run_node_immediately.go:33-48

func stampNodeChainIfUnset(node *avsproto.TaskNode, chainID int64) {
    if cw := node.GetContractWrite(); ...   // ✓
    if cr := node.GetContractRead(); ...    // ✓
    if et := node.GetEthTransfer(); ...     // ✓
    // Loop node's inner runner is not stamped
}

If RunNodeImmediately is called with a Loop node whose inner runner is a contractWrite/contractRead/ethTransfer with chain_id=0, the stamp won't reach the runner → resolveSmartWalletForNode returns an error at execution time. In practice ExtractNodeConfiguration now emits chainId for loop runners, so the chain should already be set if the node came from storage. But for directly-constructed test or SDK nodes it's a gap.


Minor: Stale comment in triggerMonitoringChainID

engine.go:694

// A 0 on the trigger means "inherit", so we fall back to the task chain (legacy, until G5 removes the task chain).

Post-G5 the task chain is removed; the fallback is always 0 from callers (triggerMonitoringChainID(task.Trigger, 0)). The "inherit" language is vestigial. Worth cleaning up so it doesn't mislead future readers.


Observation: checkNodeChain validates only one level of Loop nesting

engine.go:checkNodeChain walks ContractWrite / ContractRead / EthTransfer at the top level and one level inside LoopNode. A Branch node whose branches contain chain-aware nodes is not validated at create time — those nodes would get a runtime error instead of a create-time rejection. This is likely an accepted limitation, but worth a comment to that effect so future contributors know not to add Branch handling elsewhere.


Positive highlights

  • Migration detection logic is unambiguous: int64-parseable part[1] == old schema. Clean.
  • resolveSmartWalletForNode now returns error instead of silently falling back to the wrong chain — eliminates the Sentry EIGENLAYER-AVS-1N/1M footgun.
  • Proto field reservation (reserved 16; reserved 10;) correctly blocks field reuse — good proto discipline.
  • ExtractNodeConfigurationCreateNodeFromType round-trip now preserves chainId for all three chain-aware node types. The TestExtractNodeConfiguration_CarriesChainId test locks this regression-free.
  • collectOrphans correctly guards with chainNeedsOperatorMonitoring before touching the chain map — cron/manual/fixedtime tasks are not wrongly flagged as orphaned.
  • userOwnsWalletOnAnyChain at create time correctly scans all configured chains.
  • Summarizer cleanup removes ~5000+ lines of dead legacy code (deterministic summarizer, Telegram direct path, per-format renderers). Good.
  • PLAN_CHAIN_DECOUPLING.md — 498-line planning doc committed to the repo. Worth a follow-up to move to wiki/docs, but not blocking.

Summary

Hold the merge (as stated in PR description — SDK/studio lockstep required). The wallet-ownership asymmetry between create and execution paths is the one finding that could cause production workflows to get auto-disabled on first execution in a multi-chain gateway deployment. The salt and fee-ledger chain uses are lower-severity but share the same root cause. Everything else is clean.

#637)

Co-authored-by: Chris Li <chris@avaprotocol.org>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chrisli30 chrisli30 merged commit a8ae047 into main Jun 26, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants