fix: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth by chrisli30 · Pull Request #522 · AvaProtocol/EigenLayer-AVS

chrisli30 · 2026-04-13T22:56:52Z

docs: add wallet salt index migration runbook with verified Base apply
docs: move wallet salt index runbook to avs-infra repo
fix: refuse zero-address subject in API key auth flow
refactor: replace PARTIAL_SUCCESS with clear SUCCESS/FAILED/ERROR execution status (refactor: replace PARTIAL_SUCCESS with clear SUCCESS/FAILED/ERROR execution status #521)

Track the operational runbook for the /ava backfill-wallet-salt-index migration in version control. Captures the production rollout pattern that worked for Base on ap-prod1 (2026-04-11): - The container's config dir is /app/config (read-only bind from /home/ava/ap-aggregator-setup/<chain>/config), filename is aggregator.yaml regardless of chain. - The aggregator container's launch command is /ava aggregator with no explicit --config flag, so the runtime picks up the default. The backfill subcommand does NOT inherit that default and requires --config /app/config/aggregator.yaml passed explicitly. - Use docker run --rm --volumes-from aggregator-base to inherit the stopped container's mounts without needing to know host paths. Includes verified Base numbers (702 rows / 665 stale / 34 canonical / 3 negative-salt skipped), notes on the Skipped — RPC derive error counter actually catching legacy negative-salt rows from a past bug, and a chain-by-chain table with the current migration status.

The migration runbook is operational documentation that lives more naturally in the avs-infra repo (which holds Terraform configs, deploy scripts, and other ops-flavored docs like Chain_Endpoint_Key_Rotation.md and EIP-7702-Smart-Wallet-Migration.md) than in the application repo. It also was never actually tracked under .github/workflows/ correctly — that path is for GitHub Actions workflow YAML files, not standalone markdown runbooks. Moved to: avs-infra/Wallet_Salt_Index_Migration.md

Defense-in-depth against the dummy-target-address bypass we hit while trying to recover funds from a stale Base smart wallet. The two-step auth flow lets a caller request a JWT with sub = whatever wallet address they passed to GetSignatureFormat, and a buggy SDK was passing 0x0000000000000000000000000000000000000000 as a dummy. The IsHexAddress check at verifyAuth's last-line defense passed (zero IS a valid hex address), the user's identity silently degraded to the zero address, and every w:<owner>:<wallet> ownership lookup failed with a confusing "smart wallet not found for owner 0x0000..." error. The SDK side has been fixed in ava-sdk-js commit 2dc8027 (which refuses the API-key path when targetAddress is missing). This commit adds the matching server-side defense at all three points where the zero address could enter the auth flow: 1. GetSignatureFormat — the earliest point. Refuses to issue a signature template with the zero address as Wallet:. 2. GetKey — the mint point. Refuses to mint a JWT bound to the zero address even if a caller manages to construct a valid request. 3. verifyAuth — the last-line defense. Refuses any inbound JWT whose sub claim is the zero address, regardless of how it was minted. Three unit tests cover each defense layer.

…cution status (#521) Co-authored-by: Wei Lin <wei@avaprotocol.org> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

claude · 2026-04-13T22:59:36Z

Code Review

This PR bundles two independent fixes — a security hardening in the auth flow and a semantic cleanup of execution statuses. Both are well-reasoned. Some observations below.

`aggregator/auth.go` — zero-address defense

The three-layer guard (GetSignatureFormat → GetKey → verifyAuth) is the right defense-in-depth approach. A few notes:

Minor: redundant conversion in GetSignatureFormat

// auth.go:332 area
if common.HexToAddress(walletAddress) == (common.Address{}) {

walletAddress was already validated by common.IsHexAddress just above and is never used again in this function. Consider saving to a named variable (matching the style of GetKey's ownerAddress) so the intent is clearer, though this is cosmetic.

Tests are solid — TestVerifyAuth_RejectsZeroAddressSubject correctly simulates the real bypass path by minting a JWT with authkey metadata, which matches how the actual auth path works. Good coverage of all three injection points.

`core/taskengine/vm.go` — PARTIAL_SUCCESS removal

The conceptual change is correct: branch skips are expected behavior, not warnings. Collapsing "some failed" and "all failed" into a single FAILED status is cleaner. A few things worth flagging:

1. ExecutionError is dead code in the result analysis path

AnalyzeExecutionResult never returns ExecutionError — the constant is defined in the iota but only appears in convertToExecutionStatus. The actual ERROR status is set directly on execution.Status in executor.go:668 and engine.go:3096, bypassing convertToExecutionStatus entirely:

// executor.go:668
execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR

So case ExecutionError in convertToExecutionStatus (vm.go:3330) is unreachable. This isn't a runtime bug, but it creates a misleading API — a caller reading AnalyzeExecutionResult's return type would expect to handle ExecutionError but never actually receives it. Options:

Remove ExecutionError from the ExecutionResultStatus iota and document that ERROR is set only on the proto status directly, OR
Have AnalyzeExecutionResult return ExecutionError when len(v.ExecutionLogs) == 0 (currently returns ExecutionFailed), reserving that path for the system-level case

2. Silently ignored failed steps with empty error messages

// vm.go:3241
if !step.Success && step.Error != "" {
    failedStepNames = append(failedStepNames, getStepDisplayName(step))
}

A step with success=false and step.Error == "" is neither counted as failed nor as successful — it's invisible to AnalyzeExecutionResult. If any node can produce success=false with an empty error string (e.g., a soft failure with no message), the overall status will be SUCCESS even though a step didn't succeed. Worth verifying this can't happen, or changing the condition to !step.Success only.

3. switch resultStatus blocks have no ExecutionError case

In engine.go (~line 3073) and executor.go (~line 652), the switch only handles ExecutionSuccess and ExecutionFailed. If ExecutionError were ever returned by AnalyzeExecutionResult in the future, it would silently fall through without logging. A default case with a warning log would make this more robust.

`core/taskengine/vm_runner_rest.go` — email notification

The switch resultStatus blocks at lines ~923 and ~941 have no default or ExecutionError case. Since resultStatus in this function is set only from the local failed bool (only ever ExecutionSuccess or ExecutionFailed), ExecutionError is unreachable here. No bug, but a default guard would make that explicit.

`pkg/erc4337/preset/builder_execution_success_test.go`

The withdrawalAmount.Sign() > 0 guard is reasonable for a flaky integration test, but the else branch only logs — it doesn't call t.Skip(). This means the test passes silently if the precondition isn't met, giving false confidence in CI. Consider:

} else {
    t.Skip("balance already below reserve; skipping withdrawal test")
}

Proto changes

Reserving value 4 with both reserved 4; and reserved "EXECUTION_STATUS_PARTIAL_SUCCESS"; is correct practice. The generated avs.pb.go correctly reflects this. The migration doc (docs/changes/0001-execution-status-redesign.md) is clear and well-structured — the table format is a nice touch.

Summary

Area	Verdict
Zero-address auth fix	✅ Correct and well-tested
`ExecutionError` dead code in result path	⚠️ Confusing, low risk
Silent ignore of `success=false` + empty error	⚠️ Latent bug risk, worth verifying
Branch-skip → SUCCESS semantic change	✅ Correct, breaking change documented
Proto reservation of value 4	✅ Correct
Integration test `else` branch	🔹 Minor: use `t.Skip` instead of just logging

The core logic is sound. The two items worth a follow-up before merging to main are the ExecutionError dead-code confusion and the step-failure filter condition.

Copilot

Pull request overview

Updates execution outcome semantics across the AVS proto + taskengine to remove ambiguous “partial success”, and hardens the aggregator auth flow against zero-address subjects.

Changes:

Redesign execution status to SUCCESS / FAILED / ERROR, reserving the old PARTIAL_SUCCESS wire value.
Refactor taskengine execution analysis + tests to treat branch/conditional skips as SUCCESS and step failures as FAILED.
Add defense-in-depth validation to reject the zero address throughout the API key auth flow, plus tests.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
protobuf/avs.proto	Replaces PARTIAL_SUCCESS with reserved value + adds ERROR.
protobuf/avs.pb.go	Regenerated Go bindings reflecting the new enum semantics.
core/taskengine/vm.go	Refactors execution result analysis and status conversion.
core/taskengine/executor.go	Uses new AnalyzeExecutionResult signature; attempts to set ERROR on VM-level failure.
core/taskengine/engine.go	Same as executor.go for SimulateTask.
core/taskengine/vm_runner_rest.go	Updates UI/email status text + icons to remove partial success.
core/taskengine/vm_scheduler_fix_test.go	Updates branch-workflow expectations to SUCCESS with empty error.
core/taskengine/executor_test.go	Updates branch-workflow expectations to SUCCESS with empty error.
core/taskengine/partial_success_test.go	Renames/updates tests to FAILED semantics and new return signature.
docs/changes/0001-execution-status-redesign.md	Adds migration notes and rationale for the status redesign.
aggregator/auth.go	Rejects zero-address wallet/subject in GetSignatureFormat, GetKey, and verifyAuth.
aggregator/auth_test.go	Adds tests asserting zero-address rejection at all three auth steps.
pkg/erc4337/preset/builder_execution_success_test.go	Makes withdrawal conditional when balance is already below reserve.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T23:00:31Z

+			execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR
 		}


When vm.Run() returns a non-nil runTaskErr (system-level failure), the execution should be marked as EXECUTION_STATUS_ERROR regardless of whether AnalyzeExecutionResult already populated execution.Error. As written, the ERROR status is only set when execution.Error == "", so VM crashes/compilation errors that also produce an AnalyzeExecutionResult message (e.g., "no execution steps found") will incorrectly remain FAILED and may never surface as ERROR.

Suggested change

execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR

}

}

execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR

Copilot · 2026-04-13T23:00:31Z

 	if runErr != nil {
-		// This should not happen if AnalyzeExecutionResult is working correctly,
-		// but handle it as a fallback for VM-level errors
 		n.logger.Error("workflow simulation had VM-level error", "vm_error", runErr, "task_id", task.Id, "simulation_id", simulationID)
 		if execution.Error == "" {
 			execution.Error = fmt.Sprintf("VM execution error: %s", runErr.Error())
-			execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_FAILED
+			execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR
 		}


The system-level runErr path should set the execution status to EXECUTION_STATUS_ERROR even if execution.Error is already populated by AnalyzeExecutionResult. With the current guard (only setting ERROR when execution.Error == ""), VM-level failures can be misreported as FAILED (e.g., "no execution steps found"), which conflicts with the new SUCCESS/FAILED/ERROR semantics.

Copilot · 2026-04-13T23:00:31Z

+	// If balance is already below the reserve, the precondition (insufficient
+	// funds for reimbursement) is already satisfied — skip the withdrawal.
+	if withdrawalAmount.Sign() > 0 {
+		calldata, err := aa.PackExecute(secondaryWallet, withdrawalAmount, []byte{})
+		require.NoError(t, err, "Failed to pack execute calldata")

-	paymasterRequest := GetVerifyingPaymasterRequestForDuration(
-		smartWalletConfig.PaymasterAddress,
-		15*time.Minute,
-	)
+		paymasterRequest := GetVerifyingPaymasterRequestForDuration(
+			smartWalletConfig.PaymasterAddress,
+			15*time.Minute,
+		)

-	// Withdrawal should succeed — system skips reimbursement when balance is insufficient
-	userOp, receipt, err := SendUserOp(
-		smartWalletConfig,
-		owner,
-		calldata,
-		paymasterRequest,
-		&primaryWallet,
-		nil,
-		nil, // executionFeeWei
-		nil,
-	)
-	require.NoError(t, err, "Withdrawal should succeed even without reimbursement")
-	require.NotNil(t, userOp, "UserOp should be built")
-	if receipt == nil {
-		t.Skip("UserOp sent but receipt not available (confirmation timeout)")
+		// Withdrawal should succeed — system skips reimbursement when balance is insufficient
+		userOp, receipt, err := SendUserOp(
+			smartWalletConfig,
+			owner,
+			calldata,
+			paymasterRequest,
+			&primaryWallet,
+			nil,
+			nil, // executionFeeWei
+			nil,
+		)
+		require.NoError(t, err, "Withdrawal should succeed even without reimbursement")
+		require.NotNil(t, userOp, "UserOp should be built")
+		if receipt == nil {
+			t.Skip("UserOp sent but receipt not available (confirmation timeout)")
+		}
+		t.Logf("Withdrawal succeeded. TX Hash: %s Gas used: %d", receipt.TxHash.Hex(), receipt.GasUsed)
+	} else {
+		t.Logf("Balance already below reserve (%s < %s), skipping withdrawal", balance.String(), reserve.String())
 	}


This test can now become a no-op success: if primary balance is already below the reserve, it logs and skips the withdrawal entirely, but still passes without exercising the reimbursement-skip behavior described by the test name. Consider using t.Skip in the withdrawalAmount <= 0 branch (or explicitly funding the wallet / adjusting the reserve) so CI doesn’t report a false positive when the precondition isn’t met.

Copilot · 2026-04-13T23:00:32Z

+Three execution statuses, orthogonal to step count:
+
+| Scenario                        | Status    | `steps.length` vs task node count | `execution.error`                  |
+|---------------------------------|-----------|-----------------------------------|------------------------------------|
+| All nodes ran, all succeeded    | `SUCCESS` | equal                             | empty                              |
+| Branch skipped nodes, all OK    | `SUCCESS` | less than total                   | empty                              |
+| Some nodes failed               | `FAILED`  | any                               | `"N of M steps failed: node1, …"`  |
+| All nodes failed                | `FAILED`  | equal                             | `"N of N steps failed: node1, …"`  |
+| No steps executed               | `FAILED`  | zero                              | `"no execution steps found"`       |
+| System-level failure (VM crash) | `ERROR`   | zero (or partial if crash mid-run) | `"VM execution error: …"`          |


The markdown table under “Decision” uses a double leading pipe (|| ...) on the header and rows, which renders as an extra empty first column in many markdown renderers. Use a standard table format with a single leading pipe (| Scenario | ... |) so the table displays correctly in GitHub.

…low-balance test VM-level errors (runTaskErr/runErr) now always override execution status to ERROR, even when AnalyzeExecutionResult already populated the error field. Previously, the ERROR status was skipped if execution.Error was non-empty, causing VM crashes to be misreported as FAILED. Also changed the reimbursement test's low-balance branch from t.Logf to t.Skipf so CI doesn't report a false positive when the precondition is already met.

weilicious and others added 4 commits April 11, 2026 01:45

refactor: replace PARTIAL_SUCCESS with clear SUCCESS/FAILED/ERROR exe…

aa6b3dd

…cution status (#521) Co-authored-by: Wei Lin <wei@avaprotocol.org> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

chrisli30 requested a review from Copilot April 13, 2026 22:57

chrisli30 temporarily deployed to Test April 13, 2026 22:57 — with GitHub Actions Inactive

Copilot started reviewing on behalf of chrisli30 April 13, 2026 22:57 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

chrisli30 temporarily deployed to Test April 13, 2026 23:05 — with GitHub Actions Inactive

chrisli30 changed the title ~~staging~~ feat: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth Apr 13, 2026

chrisli30 changed the title ~~feat: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth~~ fix: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth Apr 13, 2026

chrisli30 merged commit edf1621 into main Apr 13, 2026
18 checks passed

chrisli30 mentioned this pull request Apr 13, 2026

fix: loop node GraphQL runner extraction and execution status hardening #524

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth#522

fix: replace PARTIAL_SUCCESS with SUCCESS/FAILED/ERROR execution status and harden auth#522
chrisli30 merged 5 commits into
mainfrom
staging

chrisli30 commented Apr 13, 2026

Uh oh!

claude Bot commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		execution.Status = avsproto.ExecutionStatus_EXECUTION_STATUS_ERROR
		}

Conversation

chrisli30 commented Apr 13, 2026

Uh oh!

claude Bot commented Apr 13, 2026

Code Review

aggregator/auth.go — zero-address defense

core/taskengine/vm.go — PARTIAL_SUCCESS removal

core/taskengine/vm_runner_rest.go — email notification

pkg/erc4337/preset/builder_execution_success_test.go

Proto changes

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`aggregator/auth.go` — zero-address defense

`core/taskengine/vm.go` — PARTIAL_SUCCESS removal

`core/taskengine/vm_runner_rest.go` — email notification

`pkg/erc4337/preset/builder_execution_success_test.go`