fix: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth#522
Conversation
chrisli30
commented
Apr 13, 2026
- docs: add wallet salt index migration runbook with verified Base apply
- docs: move wallet salt index runbook to avs-infra repo
- fix: refuse zero-address subject in API key auth flow
- refactor: replace PARTIAL_SUCCESS with clear SUCCESS/FAILED/ERROR execution status (refactor: replace PARTIAL_SUCCESS with clear SUCCESS/FAILED/ERROR execution status #521)
Track the operational runbook for the /ava backfill-wallet-salt-index migration in version control. Captures the production rollout pattern that worked for Base on ap-prod1 (2026-04-11): - The container's config dir is /app/config (read-only bind from /home/ava/ap-aggregator-setup/<chain>/config), filename is aggregator.yaml regardless of chain. - The aggregator container's launch command is /ava aggregator with no explicit --config flag, so the runtime picks up the default. The backfill subcommand does NOT inherit that default and requires --config /app/config/aggregator.yaml passed explicitly. - Use docker run --rm --volumes-from aggregator-base to inherit the stopped container's mounts without needing to know host paths. Includes verified Base numbers (702 rows / 665 stale / 34 canonical / 3 negative-salt skipped), notes on the Skipped — RPC derive error counter actually catching legacy negative-salt rows from a past bug, and a chain-by-chain table with the current migration status.
The migration runbook is operational documentation that lives more naturally in the avs-infra repo (which holds Terraform configs, deploy scripts, and other ops-flavored docs like Chain_Endpoint_Key_Rotation.md and EIP-7702-Smart-Wallet-Migration.md) than in the application repo. It also was never actually tracked under .github/workflows/ correctly — that path is for GitHub Actions workflow YAML files, not standalone markdown runbooks. Moved to: avs-infra/Wallet_Salt_Index_Migration.md
Defense-in-depth against the dummy-target-address bypass we hit while trying to recover funds from a stale Base smart wallet. The two-step auth flow lets a caller request a JWT with sub = whatever wallet address they passed to GetSignatureFormat, and a buggy SDK was passing 0x0000000000000000000000000000000000000000 as a dummy. The IsHexAddress check at verifyAuth's last-line defense passed (zero IS a valid hex address), the user's identity silently degraded to the zero address, and every w:<owner>:<wallet> ownership lookup failed with a confusing "smart wallet not found for owner 0x0000..." error. The SDK side has been fixed in ava-sdk-js commit 2dc8027 (which refuses the API-key path when targetAddress is missing). This commit adds the matching server-side defense at all three points where the zero address could enter the auth flow: 1. GetSignatureFormat — the earliest point. Refuses to issue a signature template with the zero address as Wallet:. 2. GetKey — the mint point. Refuses to mint a JWT bound to the zero address even if a caller manages to construct a valid request. 3. verifyAuth — the last-line defense. Refuses any inbound JWT whose sub claim is the zero address, regardless of how it was minted. Three unit tests cover each defense layer.
…cution status (#521) Co-authored-by: Wei Lin <wei@avaprotocol.org> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Code ReviewThis PR bundles two independent fixes — a security hardening in the auth flow and a semantic cleanup of execution statuses. Both are well-reasoned. Some observations below.
|
| Area | Verdict |
|---|---|
| Zero-address auth fix | ✅ Correct and well-tested |
ExecutionError dead code in result path |
|
Silent ignore of success=false + empty error |
|
| Branch-skip → SUCCESS semantic change | ✅ Correct, breaking change documented |
| Proto reservation of value 4 | ✅ Correct |
Integration test else branch |
🔹 Minor: use t.Skip instead of just logging |
The core logic is sound. The two items worth a follow-up before merging to main are the ExecutionError dead-code confusion and the step-failure filter condition.
There was a problem hiding this comment.
Pull request overview
Updates execution outcome semantics across the AVS proto + taskengine to remove ambiguous “partial success”, and hardens the aggregator auth flow against zero-address subjects.
Changes:
- Redesign execution status to SUCCESS / FAILED / ERROR, reserving the old PARTIAL_SUCCESS wire value.
- Refactor taskengine execution analysis + tests to treat branch/conditional skips as SUCCESS and step failures as FAILED.
- Add defense-in-depth validation to reject the zero address throughout the API key auth flow, plus tests.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| protobuf/avs.proto | Replaces PARTIAL_SUCCESS with reserved value + adds ERROR. |
| protobuf/avs.pb.go | Regenerated Go bindings reflecting the new enum semantics. |
| core/taskengine/vm.go | Refactors execution result analysis and status conversion. |
| core/taskengine/executor.go | Uses new AnalyzeExecutionResult signature; attempts to set ERROR on VM-level failure. |
| core/taskengine/engine.go | Same as executor.go for SimulateTask. |
| core/taskengine/vm_runner_rest.go | Updates UI/email status text + icons to remove partial success. |
| core/taskengine/vm_scheduler_fix_test.go | Updates branch-workflow expectations to SUCCESS with empty error. |
| core/taskengine/executor_test.go | Updates branch-workflow expectations to SUCCESS with empty error. |
| core/taskengine/partial_success_test.go | Renames/updates tests to FAILED semantics and new return signature. |
| docs/changes/0001-execution-status-redesign.md | Adds migration notes and rationale for the status redesign. |
| aggregator/auth.go | Rejects zero-address wallet/subject in GetSignatureFormat, GetKey, and verifyAuth. |
| aggregator/auth_test.go | Adds tests asserting zero-address rejection at all three auth steps. |
| pkg/erc4337/preset/builder_execution_success_test.go | Makes withdrawal conditional when balance is already below reserve. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR | ||
| } |
There was a problem hiding this comment.
When vm.Run() returns a non-nil runTaskErr (system-level failure), the execution should be marked as EXECUTION_STATUS_ERROR regardless of whether AnalyzeExecutionResult already populated execution.Error. As written, the ERROR status is only set when execution.Error == "", so VM crashes/compilation errors that also produce an AnalyzeExecutionResult message (e.g., "no execution steps found") will incorrectly remain FAILED and may never surface as ERROR.
| execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR | |
| } | |
| } | |
| execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR |
| if runErr != nil { | ||
| // This should not happen if AnalyzeExecutionResult is working correctly, | ||
| // but handle it as a fallback for VM-level errors | ||
| n.logger.Error("workflow simulation had VM-level error", "vm_error", runErr, "task_id", task.Id, "simulation_id", simulationID) | ||
| if execution.Error == "" { | ||
| execution.Error = fmt.Sprintf("VM execution error: %s", runErr.Error()) | ||
| execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_FAILED | ||
| execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR | ||
| } |
There was a problem hiding this comment.
The system-level runErr path should set the execution status to EXECUTION_STATUS_ERROR even if execution.Error is already populated by AnalyzeExecutionResult. With the current guard (only setting ERROR when execution.Error == ""), VM-level failures can be misreported as FAILED (e.g., "no execution steps found"), which conflicts with the new SUCCESS/FAILED/ERROR semantics.
| // If balance is already below the reserve, the precondition (insufficient | ||
| // funds for reimbursement) is already satisfied — skip the withdrawal. | ||
| if withdrawalAmount.Sign() > 0 { | ||
| calldata, err := aa.PackExecute(secondaryWallet, withdrawalAmount, []byte{}) | ||
| require.NoError(t, err, "Failed to pack execute calldata") | ||
|
|
||
| paymasterRequest := GetVerifyingPaymasterRequestForDuration( | ||
| smartWalletConfig.PaymasterAddress, | ||
| 15*time.Minute, | ||
| ) | ||
| paymasterRequest := GetVerifyingPaymasterRequestForDuration( | ||
| smartWalletConfig.PaymasterAddress, | ||
| 15*time.Minute, | ||
| ) | ||
|
|
||
| // Withdrawal should succeed — system skips reimbursement when balance is insufficient | ||
| userOp, receipt, err := SendUserOp( | ||
| smartWalletConfig, | ||
| owner, | ||
| calldata, | ||
| paymasterRequest, | ||
| &primaryWallet, | ||
| nil, | ||
| nil, // executionFeeWei | ||
| nil, | ||
| ) | ||
| require.NoError(t, err, "Withdrawal should succeed even without reimbursement") | ||
| require.NotNil(t, userOp, "UserOp should be built") | ||
| if receipt == nil { | ||
| t.Skip("UserOp sent but receipt not available (confirmation timeout)") | ||
| // Withdrawal should succeed — system skips reimbursement when balance is insufficient | ||
| userOp, receipt, err := SendUserOp( | ||
| smartWalletConfig, | ||
| owner, | ||
| calldata, | ||
| paymasterRequest, | ||
| &primaryWallet, | ||
| nil, | ||
| nil, // executionFeeWei | ||
| nil, | ||
| ) | ||
| require.NoError(t, err, "Withdrawal should succeed even without reimbursement") | ||
| require.NotNil(t, userOp, "UserOp should be built") | ||
| if receipt == nil { | ||
| t.Skip("UserOp sent but receipt not available (confirmation timeout)") | ||
| } | ||
| t.Logf("Withdrawal succeeded. TX Hash: %s Gas used: %d", receipt.TxHash.Hex(), receipt.GasUsed) | ||
| } else { | ||
| t.Logf("Balance already below reserve (%s < %s), skipping withdrawal", balance.String(), reserve.String()) | ||
| } |
There was a problem hiding this comment.
This test can now become a no-op success: if primary balance is already below the reserve, it logs and skips the withdrawal entirely, but still passes without exercising the reimbursement-skip behavior described by the test name. Consider using t.Skip in the withdrawalAmount <= 0 branch (or explicitly funding the wallet / adjusting the reserve) so CI doesn’t report a false positive when the precondition isn’t met.
| Three execution statuses, orthogonal to step count: | ||
|
|
||
| | Scenario | Status | `steps.length` vs task node count | `execution.error` | | ||
| |---------------------------------|-----------|-----------------------------------|------------------------------------| | ||
| | All nodes ran, all succeeded | `SUCCESS` | equal | empty | | ||
| | Branch skipped nodes, all OK | `SUCCESS` | less than total | empty | | ||
| | Some nodes failed | `FAILED` | any | `"N of M steps failed: node1, …"` | | ||
| | All nodes failed | `FAILED` | equal | `"N of N steps failed: node1, …"` | | ||
| | No steps executed | `FAILED` | zero | `"no execution steps found"` | | ||
| | System-level failure (VM crash) | `ERROR` | zero (or partial if crash mid-run) | `"VM execution error: …"` | |
There was a problem hiding this comment.
The markdown table under “Decision” uses a double leading pipe (|| ...) on the header and rows, which renders as an extra empty first column in many markdown renderers. Use a standard table format with a single leading pipe (| Scenario | ... |) so the table displays correctly in GitHub.
…low-balance test VM-level errors (runTaskErr/runErr) now always override execution status to ERROR, even when AnalyzeExecutionResult already populated the error field. Previously, the ERROR status was skipped if execution.Error was non-empty, causing VM crashes to be misreported as FAILED. Also changed the reimbursement test's low-balance branch from t.Logf to t.Skipf so CI doesn't report a false positive when the precondition is already met.