Skip to content

fix: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth#522

Merged
chrisli30 merged 5 commits into
mainfrom
staging
Apr 13, 2026
Merged

fix: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth#522
chrisli30 merged 5 commits into
mainfrom
staging

Conversation

@chrisli30
Copy link
Copy Markdown
Member

weilicious and others added 4 commits April 11, 2026 01:45
Track the operational runbook for the /ava backfill-wallet-salt-index
migration in version control. Captures the production rollout pattern
that worked for Base on ap-prod1 (2026-04-11):

- The container's config dir is /app/config (read-only bind from
  /home/ava/ap-aggregator-setup/<chain>/config), filename is
  aggregator.yaml regardless of chain.
- The aggregator container's launch command is /ava aggregator with no
  explicit --config flag, so the runtime picks up the default. The
  backfill subcommand does NOT inherit that default and requires
  --config /app/config/aggregator.yaml passed explicitly.
- Use docker run --rm --volumes-from aggregator-base to inherit the
  stopped container's mounts without needing to know host paths.

Includes verified Base numbers (702 rows / 665 stale / 34 canonical /
3 negative-salt skipped), notes on the Skipped — RPC derive error
counter actually catching legacy negative-salt rows from a past bug,
and a chain-by-chain table with the current migration status.
The migration runbook is operational documentation that lives more
naturally in the avs-infra repo (which holds Terraform configs, deploy
scripts, and other ops-flavored docs like Chain_Endpoint_Key_Rotation.md
and EIP-7702-Smart-Wallet-Migration.md) than in the application repo.

It also was never actually tracked under .github/workflows/ correctly —
that path is for GitHub Actions workflow YAML files, not standalone
markdown runbooks.

Moved to: avs-infra/Wallet_Salt_Index_Migration.md
Defense-in-depth against the dummy-target-address bypass we hit while
trying to recover funds from a stale Base smart wallet. The two-step
auth flow lets a caller request a JWT with sub = whatever wallet
address they passed to GetSignatureFormat, and a buggy SDK was passing
0x0000000000000000000000000000000000000000 as a dummy. The IsHexAddress
check at verifyAuth's last-line defense passed (zero IS a valid hex
address), the user's identity silently degraded to the zero address,
and every w:<owner>:<wallet> ownership lookup failed with a confusing
"smart wallet not found for owner 0x0000..." error.

The SDK side has been fixed in ava-sdk-js commit 2dc8027 (which
refuses the API-key path when targetAddress is missing). This commit
adds the matching server-side defense at all three points where the
zero address could enter the auth flow:

1. GetSignatureFormat — the earliest point. Refuses to issue a
   signature template with the zero address as Wallet:.
2. GetKey — the mint point. Refuses to mint a JWT bound to the zero
   address even if a caller manages to construct a valid request.
3. verifyAuth — the last-line defense. Refuses any inbound JWT whose
   sub claim is the zero address, regardless of how it was minted.

Three unit tests cover each defense layer.
…cution status (#521)

Co-authored-by: Wei Lin <wei@avaprotocol.org>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 13, 2026

Code Review

This PR bundles two independent fixes — a security hardening in the auth flow and a semantic cleanup of execution statuses. Both are well-reasoned. Some observations below.


aggregator/auth.go — zero-address defense

The three-layer guard (GetSignatureFormat → GetKey → verifyAuth) is the right defense-in-depth approach. A few notes:

Minor: redundant conversion in GetSignatureFormat

// auth.go:332 area
if common.HexToAddress(walletAddress) == (common.Address{}) {

walletAddress was already validated by common.IsHexAddress just above and is never used again in this function. Consider saving to a named variable (matching the style of GetKey's ownerAddress) so the intent is clearer, though this is cosmetic.

Tests are solidTestVerifyAuth_RejectsZeroAddressSubject correctly simulates the real bypass path by minting a JWT with authkey metadata, which matches how the actual auth path works. Good coverage of all three injection points.


core/taskengine/vm.go — PARTIAL_SUCCESS removal

The conceptual change is correct: branch skips are expected behavior, not warnings. Collapsing "some failed" and "all failed" into a single FAILED status is cleaner. A few things worth flagging:

1. ExecutionError is dead code in the result analysis path

AnalyzeExecutionResult never returns ExecutionError — the constant is defined in the iota but only appears in convertToExecutionStatus. The actual ERROR status is set directly on execution.Status in executor.go:668 and engine.go:3096, bypassing convertToExecutionStatus entirely:

// executor.go:668
execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR

So case ExecutionError in convertToExecutionStatus (vm.go:3330) is unreachable. This isn't a runtime bug, but it creates a misleading API — a caller reading AnalyzeExecutionResult's return type would expect to handle ExecutionError but never actually receives it. Options:

  • Remove ExecutionError from the ExecutionResultStatus iota and document that ERROR is set only on the proto status directly, OR
  • Have AnalyzeExecutionResult return ExecutionError when len(v.ExecutionLogs) == 0 (currently returns ExecutionFailed), reserving that path for the system-level case

2. Silently ignored failed steps with empty error messages

// vm.go:3241
if !step.Success && step.Error != "" {
    failedStepNames = append(failedStepNames, getStepDisplayName(step))
}

A step with success=false and step.Error == "" is neither counted as failed nor as successful — it's invisible to AnalyzeExecutionResult. If any node can produce success=false with an empty error string (e.g., a soft failure with no message), the overall status will be SUCCESS even though a step didn't succeed. Worth verifying this can't happen, or changing the condition to !step.Success only.

3. switch resultStatus blocks have no ExecutionError case

In engine.go (~line 3073) and executor.go (~line 652), the switch only handles ExecutionSuccess and ExecutionFailed. If ExecutionError were ever returned by AnalyzeExecutionResult in the future, it would silently fall through without logging. A default case with a warning log would make this more robust.


core/taskengine/vm_runner_rest.go — email notification

The switch resultStatus blocks at lines ~923 and ~941 have no default or ExecutionError case. Since resultStatus in this function is set only from the local failed bool (only ever ExecutionSuccess or ExecutionFailed), ExecutionError is unreachable here. No bug, but a default guard would make that explicit.


pkg/erc4337/preset/builder_execution_success_test.go

The withdrawalAmount.Sign() > 0 guard is reasonable for a flaky integration test, but the else branch only logs — it doesn't call t.Skip(). This means the test passes silently if the precondition isn't met, giving false confidence in CI. Consider:

} else {
    t.Skip("balance already below reserve; skipping withdrawal test")
}

Proto changes

Reserving value 4 with both reserved 4; and reserved "EXECUTION_STATUS_PARTIAL_SUCCESS"; is correct practice. The generated avs.pb.go correctly reflects this. The migration doc (docs/changes/0001-execution-status-redesign.md) is clear and well-structured — the table format is a nice touch.


Summary

Area Verdict
Zero-address auth fix ✅ Correct and well-tested
ExecutionError dead code in result path ⚠️ Confusing, low risk
Silent ignore of success=false + empty error ⚠️ Latent bug risk, worth verifying
Branch-skip → SUCCESS semantic change ✅ Correct, breaking change documented
Proto reservation of value 4 ✅ Correct
Integration test else branch 🔹 Minor: use t.Skip instead of just logging

The core logic is sound. The two items worth a follow-up before merging to main are the ExecutionError dead-code confusion and the step-failure filter condition.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates execution outcome semantics across the AVS proto + taskengine to remove ambiguous “partial success”, and hardens the aggregator auth flow against zero-address subjects.

Changes:

  • Redesign execution status to SUCCESS / FAILED / ERROR, reserving the old PARTIAL_SUCCESS wire value.
  • Refactor taskengine execution analysis + tests to treat branch/conditional skips as SUCCESS and step failures as FAILED.
  • Add defense-in-depth validation to reject the zero address throughout the API key auth flow, plus tests.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
protobuf/avs.proto Replaces PARTIAL_SUCCESS with reserved value + adds ERROR.
protobuf/avs.pb.go Regenerated Go bindings reflecting the new enum semantics.
core/taskengine/vm.go Refactors execution result analysis and status conversion.
core/taskengine/executor.go Uses new AnalyzeExecutionResult signature; attempts to set ERROR on VM-level failure.
core/taskengine/engine.go Same as executor.go for SimulateTask.
core/taskengine/vm_runner_rest.go Updates UI/email status text + icons to remove partial success.
core/taskengine/vm_scheduler_fix_test.go Updates branch-workflow expectations to SUCCESS with empty error.
core/taskengine/executor_test.go Updates branch-workflow expectations to SUCCESS with empty error.
core/taskengine/partial_success_test.go Renames/updates tests to FAILED semantics and new return signature.
docs/changes/0001-execution-status-redesign.md Adds migration notes and rationale for the status redesign.
aggregator/auth.go Rejects zero-address wallet/subject in GetSignatureFormat, GetKey, and verifyAuth.
aggregator/auth_test.go Adds tests asserting zero-address rejection at all three auth steps.
pkg/erc4337/preset/builder_execution_success_test.go Makes withdrawal conditional when balance is already below reserve.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/taskengine/executor.go Outdated
Comment on lines 668 to 669
execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When vm.Run() returns a non-nil runTaskErr (system-level failure), the execution should be marked as EXECUTION_STATUS_ERROR regardless of whether AnalyzeExecutionResult already populated execution.Error. As written, the ERROR status is only set when execution.Error == "", so VM crashes/compilation errors that also produce an AnalyzeExecutionResult message (e.g., "no execution steps found") will incorrectly remain FAILED and may never surface as ERROR.

Suggested change
execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR
}
}
execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR

Copilot uses AI. Check for mistakes.
Comment thread core/taskengine/engine.go
Comment on lines 3092 to 3097
if runErr != nil {
// This should not happen if AnalyzeExecutionResult is working correctly,
// but handle it as a fallback for VM-level errors
n.logger.Error("workflow simulation had VM-level error", "vm_error", runErr, "task_id", task.Id, "simulation_id", simulationID)
if execution.Error == "" {
execution.Error = fmt.Sprintf("VM execution error: %s", runErr.Error())
execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_FAILED
execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system-level runErr path should set the execution status to EXECUTION_STATUS_ERROR even if execution.Error is already populated by AnalyzeExecutionResult. With the current guard (only setting ERROR when execution.Error == ""), VM-level failures can be misreported as FAILED (e.g., "no execution steps found"), which conflicts with the new SUCCESS/FAILED/ERROR semantics.

Copilot uses AI. Check for mistakes.
Comment on lines +87 to 117
// If balance is already below the reserve, the precondition (insufficient
// funds for reimbursement) is already satisfied — skip the withdrawal.
if withdrawalAmount.Sign() > 0 {
calldata, err := aa.PackExecute(secondaryWallet, withdrawalAmount, []byte{})
require.NoError(t, err, "Failed to pack execute calldata")

paymasterRequest := GetVerifyingPaymasterRequestForDuration(
smartWalletConfig.PaymasterAddress,
15*time.Minute,
)
paymasterRequest := GetVerifyingPaymasterRequestForDuration(
smartWalletConfig.PaymasterAddress,
15*time.Minute,
)

// Withdrawal should succeed — system skips reimbursement when balance is insufficient
userOp, receipt, err := SendUserOp(
smartWalletConfig,
owner,
calldata,
paymasterRequest,
&primaryWallet,
nil,
nil, // executionFeeWei
nil,
)
require.NoError(t, err, "Withdrawal should succeed even without reimbursement")
require.NotNil(t, userOp, "UserOp should be built")
if receipt == nil {
t.Skip("UserOp sent but receipt not available (confirmation timeout)")
// Withdrawal should succeed — system skips reimbursement when balance is insufficient
userOp, receipt, err := SendUserOp(
smartWalletConfig,
owner,
calldata,
paymasterRequest,
&primaryWallet,
nil,
nil, // executionFeeWei
nil,
)
require.NoError(t, err, "Withdrawal should succeed even without reimbursement")
require.NotNil(t, userOp, "UserOp should be built")
if receipt == nil {
t.Skip("UserOp sent but receipt not available (confirmation timeout)")
}
t.Logf("Withdrawal succeeded. TX Hash: %s Gas used: %d", receipt.TxHash.Hex(), receipt.GasUsed)
} else {
t.Logf("Balance already below reserve (%s < %s), skipping withdrawal", balance.String(), reserve.String())
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can now become a no-op success: if primary balance is already below the reserve, it logs and skips the withdrawal entirely, but still passes without exercising the reimbursement-skip behavior described by the test name. Consider using t.Skip in the withdrawalAmount <= 0 branch (or explicitly funding the wallet / adjusting the reserve) so CI doesn’t report a false positive when the precondition isn’t met.

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +37
Three execution statuses, orthogonal to step count:

| Scenario | Status | `steps.length` vs task node count | `execution.error` |
|---------------------------------|-----------|-----------------------------------|------------------------------------|
| All nodes ran, all succeeded | `SUCCESS` | equal | empty |
| Branch skipped nodes, all OK | `SUCCESS` | less than total | empty |
| Some nodes failed | `FAILED` | any | `"N of M steps failed: node1, …"` |
| All nodes failed | `FAILED` | equal | `"N of N steps failed: node1, …"` |
| No steps executed | `FAILED` | zero | `"no execution steps found"` |
| System-level failure (VM crash) | `ERROR` | zero (or partial if crash mid-run) | `"VM execution error: …"` |
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown table under “Decision” uses a double leading pipe (|| ...) on the header and rows, which renders as an extra empty first column in many markdown renderers. Use a standard table format with a single leading pipe (| Scenario | ... |) so the table displays correctly in GitHub.

Copilot uses AI. Check for mistakes.
…low-balance test

VM-level errors (runTaskErr/runErr) now always override execution status
to ERROR, even when AnalyzeExecutionResult already populated the error
field. Previously, the ERROR status was skipped if execution.Error was
non-empty, causing VM crashes to be misreported as FAILED.

Also changed the reimbursement test's low-balance branch from t.Logf to
t.Skipf so CI doesn't report a false positive when the precondition is
already met.
@chrisli30 chrisli30 changed the title staging feat: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth Apr 13, 2026
@chrisli30 chrisli30 changed the title feat: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth fix: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth Apr 13, 2026
@chrisli30 chrisli30 merged commit edf1621 into main Apr 13, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants