Skip to content

release: staging → main (event trigger fixes, fee classification, JWT API key, sentry logging)#509

Merged
chrisli30 merged 10 commits into
mainfrom
staging
Apr 8, 2026
Merged

release: staging → main (event trigger fixes, fee classification, JWT API key, sentry logging)#509
chrisli30 merged 10 commits into
mainfrom
staging

Conversation

@chrisli30
Copy link
Copy Markdown
Member

@chrisli30 chrisli30 commented Apr 8, 2026

Release contents

8 commits from staging since the last main merge (plus 3 follow-ups in this PR):

SHA Type Summary
9f93321 fix Address PR review comments (template error-on-noop, doc path, typo)
2dd4f71 ci Strip scheme from CHAIN_ENDPOINT before prepending in test config gen
de3c15f chore Add EventTrigger main loop diagnostics (heartbeat, FilterLogs cross-check, startup log)
7c05214 docs Note ava-sdk-js getWorkflow as ground truth for stored task state
3371a57 fix Resolve EventTrigger {{settings.*}} templates in CreateTask
54b0f05 fix Require EOA address as JWT API key subject
58325be fix Pass error values (not strings) to logger so Sentry captures real cause
e992701 ci Read CHAIN_ENDPOINT and BUNDLER_RPC from secrets
db8c735 fix Classify workflow value fee from definition, not runtime gas (#508)
ceb690e fix Improve event trigger simulation accuracy and response schema (#507)

⚠️ Breaking API change: TransferEventResponse.Value semantics

TransferEventResponse.Value is now the raw uint256 base-units string, not the formatted decimal-applied amount it used to be. The formatted-decimal value is now exposed as a new field ValueFormatted. Any client code doing arithmetic or display on value directly will break for tokens with non-zero decimals (e.g. USDC):

// Before — value was already decimal-formatted
const usdcAmount = parseFloat(event.value);   // worked, e.g. "1.5"

// After — value is raw base units
const usdcAmount = parseFloat(event.value);   // BROKEN, returns 1500000
const usdcAmount = parseFloat(event.valueFormatted);  // correct

Origin: commit ceb690e (already merged via #507). The wire-proto field value was repurposed; no proto field rename happened so there is no protobuf compatibility break, but the semantic meaning of the field changed. SDK consumers must be updated to read valueFormatted for human-readable amounts. This is the main thing reviewers should sanity-check before approving.

Highlights

EventTrigger correctness fix (3371a57 + follow-up 9f93321)

engine.CreateTask was persisting EventTrigger topics[] verbatim, including unresolved template literals like "{{settings.runner}}". The operator received the literal string, could not pad it as a 32-byte topic, and the slot effectively became a wildcard — so deployed event-triggered tasks fired on unrelated transfers and exhausted their max_execution budget on noise. This commit adds resolveEventTriggerTemplates() mirroring the simulate-path resolver and calls it before persistence. The follow-up fix removes the no-op short-circuit so an unresolvable {{...}} (e.g. missing inputVariables key) is now a hard error instead of a silent passthrough. Includes 5 regression tests (resolved-OK, manual-trigger no-op, nil-trigger, unresolved-topic-errors, unresolved-address-errors).

Operator observability (de3c15f)

Three new diagnostic surfaces in the EventTrigger main loop: startup confirmation, 60-second heartbeat (active subs / logs received / channel buffer), and per-subscription FilterLogs cross-check against the HTTP RPC. These let us distinguish "loop never started" / "loop alive but WS silent" / "WS delivering but filter wrong" definitively. Used to characterize a Dwellir Sepolia WS gateway bug (out-of-tree workaround: swap to Tenderly Sepolia in local configs).

CI test config gen fix (2dd4f71)

The Test-environment secret CHAIN_ENDPOINT may be set with or without a leading https:// scheme. The previous generate-test-config.sh unconditionally prepended https://, producing https://https//api-... and breaking config parsing. Now strips any leading scheme before prepending — tolerant to both forms.

Storage migration check

go run scripts/compare_storage_structure.go main results:

  • No new / removed / modified storage keys
  • No SmartWallet struct changes
  • No storage key definition changes
  • Proto schema change in TransferEventResponse (see Breaking API change section above) — runtime response shape, no BadgerDB data migration required.

Test plan

  • go test ./core/taskengine/ -run TestResolveEventTriggerTemplates — 5 regression tests pass (3 happy-path + 2 negative-case for unresolved templates)
  • End-to-end verified on Sepolia: deployed a USDC Transfer trigger, sent 1.5 USDC to the runner wallet, observed task triggered in operator log and successful execution in aggregator log
  • Reviewers: confirm SDK consumers read TransferEventResponse.ValueFormatted instead of Value for human amounts
  • Run make test after CI re-runs with the test-config-gen fix

chrisli30 and others added 8 commits April 7, 2026 23:17
)

Co-authored-by: Will Zimmerman <will@avaprotocol.org>
These values embed credentials (Dwellir API key, bundler URL) and were
previously exposed as plaintext repo variables. Move them to environment
secrets so they are masked in logs and encrypted at rest.
…l cause

SentryLogger only uses tag values typed as `error` as the captured exception;
passing `err.Error()` made Sentry fall back to the literal log message,
collapsing distinct failures into one issue with an unactionable title.
The auth layer derives a smart wallet from the JWT `sub` claim, so a
non-address subject (e.g. the previous "admin" default) silently coerced
to the zero address. Validate the subject is a hex address both at key
creation and at token verification, and propagate creation errors as a
non-zero exit from the CLI.
The EventTrigger query topics array supports template literals like
"{{settings.runner}}" that should be substituted against the task's
inputVariables before persistence. The simulate path
(run_node_immediately.go) does this resolution; CreateTask did not.

Result: deployed event-triggered tasks shipped the literal string
"{{settings.runner}}" to the operator. The operator could not pad it as
a 32-byte topic, the slot effectively became a wildcard, and the filter
matched any event on the contract instead of the intended to=runner
constraint. Tasks fired on unrelated transfers and exhausted their
max_execution budget on noise events.

Add resolveEventTriggerTemplates() that mirrors the simulate-path
resolver: walks query addresses and topics, runs each {{...}}-bearing
string through the same VM-based preprocessor, validates the result
(common.IsHexAddress for addresses, validateTopicHexFormat for topics),
and mutates the proto in place before storage. Called from CreateTask
immediately after model.NewTaskFromProtobuf and before persistence so
the operator receives a fully-resolved trigger via the sync stream.

Includes regression tests for topic-address resolution, manual-trigger
no-op, and nil-trigger safety.
Add a section under Storage describing how to use the ava-sdk-js
examples/example.ts CLI (yarn start getWorkflow <task_id>) to inspect
the live aggregator's BadgerDB state. CreateTask's log lines only emit
a high-level summary; the SDK getWorkflow path returns the full stored
trigger payload (topics, addresses, inputVariables verbatim), which is
the source of truth when debugging trigger filter issues.
Adds three diagnostic surfaces to the operator's event trigger loop to
make WS log-stream health visible:

1. Startup confirmation log when the main goroutine enters its select
   loop, so we can distinguish "never started" from "started but idle".

2. 60-second heartbeat showing active subscription count, logs received
   in the last interval, total logs received, and logs channel buffer
   fill. The crucial signal: active_subscriptions > 0 with
   logs_received_last_60s == 0 for several minutes means the WS log
   stream is silent despite registered subscriptions.

3. Per-subscription FilterLogs cross-check that runs the same
   FilterQuery against the HTTP RPC for the last ~50 blocks immediately
   after each new WS subscription is created. If FilterLogs returns
   matches but the heartbeat shows zero received, the WS layer is
   broken independently of the filter shape or chain state.

These diagnostics were used to characterize a Dwellir Sepolia WS
gateway bug where eth_subscribe(logs, ...) with a multi-element topics
array silently accepts the subscribe call and returns a real-looking
subscription ID but never delivers matching events. The combination of
"FilterLogs cross-check returns >0 matches" + "heartbeat shows 0
received" was the smoking gun.
@blacksmith-sh
Copy link
Copy Markdown
Contributor

blacksmith-sh Bot commented Apr 8, 2026

Found 5 test failures on Blacksmith runners:

Failures

Test View Logs
github.com/AvaProtocol/EigenLayer-AVS/aggregator/TestGetSignatureFormat View Logs
github.com/AvaProtocol/EigenLayer-AVS/core/taskengine/
TestBranchNode_EmailSummaryGeneration
View Logs
github.com/AvaProtocol/EigenLayer-AVS/core/taskengine/macros/TestQueryContract View Logs
github.com/AvaProtocol/EigenLayer-AVS/core/taskengine/trigger/
TestBlockTrigger_AddCheck_InvalidInterval
View Logs
github.com/AvaProtocol/EigenLayer-AVS/pkg/erc4337/preset/
TestUserOpWithdrawalSkipsReimbursementWhenBalanceInsufficient
View Logs

Fix in Cursor

@claude
Copy link
Copy Markdown

claude Bot commented Apr 8, 2026

PR Review: staging → main (event trigger fixes, fee classification, JWT API key, sentry logging)

Overall this is solid work — clear bug fixes with good regression tests and well-documented intent. A few issues worth addressing before merge.


Issues

1. Hardcoded personal path in CLAUDE.md

The new "Inspecting Aggregator Data" section in CLAUDE.md contains:

cd /Users/mikasa/Code/ava-sdk-js/examples

This is a personal machine path. CLAUDE.md is a shared developer document committed to the repo — future contributors won't have this path. Replace with a relative or repo-relative reference, or describe it as <path-to-ava-sdk-js>/examples.


2. Breaking API change in TransferEventResponse.Value needs explicit callout

TransferEventResponse.Value has silently changed semantics: it was "formatted amount (decimals applied)" and is now "raw uint256 base units". Any existing client code doing:

const humanAmount = parseFloat(event.value); // previously safe, now broken for USDC etc.

…will now get the raw base-unit integer instead of the formatted number. The PR mentions checking SDK consumers for ValueFormatted, but the deeper concern is that value itself now means something different. This should be called out prominently in the PR/changelog and ideally in the wire proto comment too (not just the Go struct comment).


3. hasOnChainNodes doesn't recurse into nested loops

fee_estimator.go:472:

func hasOnChainNodes(nodes []*avsproto.TaskNode) bool {
    for _, node := range nodes {
        switch {
        case node.GetEthTransfer() != nil, node.GetContractWrite() != nil:
            return true
        case node.GetLoop() != nil:
            loop := node.GetLoop()
            if loop.GetEthTransfer() != nil || loop.GetContractWrite() != nil {
                return true
            }
        }
    }
    return false
}

A Loop whose Runner is itself a Loop containing a ContractWrite would not be detected. This isn't in the current proto definition (loops likely can't be nested in practice today), but if the data model allows it in the future this silently under-charges. At minimum add a comment acknowledging the limitation.


4. Diagnostic FilterLogs goroutine fires on every subscription restart

trigger/event.go (new goroutine block around line 677+):

go func(q ethereum.FilterQuery, desc string) {
    // ... fetches last 50 blocks via HTTP RPC ...
}(qi.Query, qi.Description)

This goroutine fires once per subscription start. If subscriptions reconnect frequently (e.g. under WS instability), this could generate a burst of HTTP RPC calls. For a diagnostic tool this is probably fine — but it might be worth adding a comment that it's a one-shot diagnostic, not ongoing polling, so future readers don't mistake it for production logic that needs rate limiting.

The goroutine closure correctly captures qi.Query and qi.Description by value — no loop-variable capture bug here. Good.


Minor Observations

err.Error()err sweep is correct. Passing error values (not strings) to structured loggers lets Sentry capture the full error type and stack. The change is consistent and mechanical throughout.

JWT subject validation placement is correct. The common.IsHexAddress check is added before common.HexToAddress is called, which previously would silently produce 0x0000…0000 for any non-address subject. The fix is safe and the error message is clear.

resolveEventTriggerTemplates ordering in CreateTask is correct: template resolution must happen before the smart wallet ownership validation (since it resolves the runner address). The comments explain this well.

propagateLoopExecutionContext is straightforward and well-tested. The "pick first non-nil" strategy is justified by the "all iterations use same runner" invariant.

SimulateEventTriggerWithDecimals pointer semantics for tokenDecimals *uint32 is the right call to distinguish "unknown" from "zero". The switch/case handling of nil, 0, and positive values is clear.

cmd/createAdminKey.go now properly propagates the error from CreateAdminKey and exits with code 1. Previously silently ignoring the return value was a real operator footgun.


Summary

Severity Count
Should fix before merge 1 (personal path in CLAUDE.md)
Worth addressing 2 (Value semantics callout, hasOnChainNodes comment)
Minor 1 (diagnostic goroutine comment)

The core bug fixes (EventTrigger template resolution, fee classification, JWT validation) are well-implemented and well-tested. Recommend fixing the CLAUDE.md path before merging.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Release merge from stagingmain focused on EventTrigger correctness/observability, aligning value-fee classification between simulation and production, tightening JWT API key subject handling, and improving Sentry error capture.

Changes:

  • Fix EventTrigger template resolution before persistence; add loop ExecutionContext propagation and richer EventTrigger diagnostics.
  • Rework value-fee classification to use workflow definition (task.Nodes) instead of runtime gas/steps; add regression tests.
  • Require JWT sub to be an EOA address; pass error objects (not strings) to structured logger; adjust CI secrets usage.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
core/taskengine/vm.go Propagate loop ExecutionContext from iteration steps to parent loop step.
core/taskengine/vm_runner_contract_write.go Log error as err (not err.Error()) for better Sentry cause capture.
core/taskengine/value_fee_simulation_test.go Engine-level SimulateTask regression tests for value-fee classification.
core/taskengine/trigger/event.go Add EventTrigger heartbeat and WS-vs-HTTP FilterLogs cross-check diagnostics.
core/taskengine/tenderly_client.go Add SimulateEventTriggerWithDecimals and scale synthetic Transfer amounts by decimals.
core/taskengine/shared_event_enrichment.go Ensure Transfer value is raw base units and add valueFormatted when decimals known.
core/taskengine/run_node_immediately.go Deprecate applyDecimalsTo for Transfer, compute valueFormatted, pass token decimals into Tenderly simulation.
core/taskengine/loop_helpers.go Add propagateLoopExecutionContext helper.
core/taskengine/loop_helpers_test.go Unit tests for loop ExecutionContext propagation behavior.
core/taskengine/fee_estimator.go Introduce hasOnChainNodes; classify value fee from workflow definition; update buildValueFee signature.
core/taskengine/fee_estimator_test.go Add regression test ensuring buildValueFee uses definition not gas.
core/taskengine/executor.go Use task.Nodes for buildValueFee in real executions.
core/taskengine/event_trigger_constants.go Update Transfer response schema: value raw + new valueFormatted.
core/taskengine/engine.go Resolve EventTrigger templates on CreateTask; use nodes-based value-fee classification in SimulateTask.
core/taskengine/engine_resolve_event_trigger_test.go Regression tests for CreateTask-time EventTrigger template resolution.
core/apqueue/worker.go Log error as err object for Sentry.
cmd/createAdminKey.go Require subject to be an EOA address; exit non-zero on failure; improve command docs.
CLAUDE.md Document how to inspect stored workflow/task state via ava-sdk-js CLI.
aggregator/task_engine.go Log error as err object for Sentry.
aggregator/rpc_server.go Log error as err object for Sentry.
aggregator/key.go Validate JWT subject is a hex EOA address when creating admin key.
aggregator/auth.go Enforce JWT sub is a valid hex address before deriving smart wallet.
aggregator/aggregator.go Log migration error as err object for Sentry.
.github/workflows/run-test-on-pr.yml Read CHAIN_ENDPOINT and BUNDLER_RPC from GitHub Secrets instead of Vars.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/taskengine/engine.go Outdated
Comment on lines +889 to +895
if !strings.Contains(addr, "{{") {
continue
}
resolved := tempVM.preprocessTextWithVariableMapping(addr)
if resolved == addr {
continue
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveEventTriggerTemplates currently skips validation when a string contains "{{" but preprocessTextWithVariableMapping returns it unchanged. That means unresolved templates (e.g. missing inputVariables) can still be persisted into q.Addresses, reintroducing the deploy-time/operator wildcard bug. Instead of continue on resolved == addr, return an error indicating the template could not be resolved (or always validate and fail when the result is not a hex address).

Copilot uses AI. Check for mistakes.
Comment thread core/taskengine/engine.go Outdated
}
resolved := tempVM.preprocessTextWithVariableMapping(topic)
if resolved == topic {
continue
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue for topics: if topic contains "{{" but resolution is a no-op, the code continues and persists the unresolved template string. This bypasses validateTopicHexFormat and can ship invalid/unresolved topic values to the operator. Consider treating resolved == topic as an error when the original contains template markers, and/or always validate the (resolved) topic whenever the original contained "{{".

Suggested change
continue
return fmt.Errorf("query[%d].topics[%d]: unresolved template in topic value: %s", queryIdx, i, topic)

Copilot uses AI. Check for mistakes.
Comment thread core/taskengine/trigger/event.go Outdated
Comment on lines +680 to +686
// DIAGNOSTIC: Cross-check the WS subscription against a
// one-shot historical FilterLogs over the last ~50 blocks
// via the HTTP RPC client. If FilterLogs returns >0 events
// but the WS subscription delivers 0 (heartbeat shows 0
// received), the WS layer is the problem, not the filter.
go func(q ethereum.FilterQuery, desc string) {
headerCtx, headerCancel := context.WithTimeout(ctx, 5*time.Second)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FilterLogs cross-check runs automatically in a goroutine for every newly created subscription. This adds extra RPC calls (HeaderByNumber + FilterLogs over ~50 blocks) that can be expensive if many tasks/subscriptions are created or frequently updated. Consider gating this behind a debug/diagnostic config flag, sampling, or a rate limiter to avoid unexpected RPC load in production.

Copilot uses AI. Check for mistakes.
Comment thread aggregator/auth.go Outdated

subject, _ := claims["sub"].(string)
if !common.IsHexAddress(subject) {
r.config.Logger.Error("API key has non-address subject; refusing to derive smart wallet for zero address",
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message says "refusing to derive smart wallet for zero address", but this branch triggers for any non-hex subject (including empty or malformed strings). Updating the message to reflect "invalid/non-address subject" will make auth failures easier to diagnose.

Suggested change
r.config.Logger.Error("API key has non-address subject; refusing to derive smart wallet for zero address",
r.config.Logger.Error("API key has invalid/non-address subject; refusing to derive smart wallet",

Copilot uses AI. Check for mistakes.
Comment thread cmd/createAdminKey.go Outdated
@@ -11,15 +14,20 @@ var (
createApiKey = &cobra.Command{
Use: "create-api-key",
Short: "Create a long live JWT key to interact with userdata of AVS",
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command description typo: "long live" should be "long-lived" (and typically "user data" instead of "userdata").

Suggested change
Short: "Create a long live JWT key to interact with userdata of AVS",
Short: "Create a long-lived JWT key to interact with user data of AVS",

Copilot uses AI. Check for mistakes.
… gen

The Test-environment secret CHAIN_ENDPOINT may be set with or without
a leading scheme depending on how it was provisioned. The previous
script unconditionally prepended https://, producing malformed URLs
like https://https//api-... when the secret already contained the
scheme. config.NewConfig then failed to parse the resulting yaml,
testConfig was nil, and every test that depended on it panicked.

Strip any leading https://, http://, wss://, ws:// from
CHAIN_ENDPOINT before composing CHAIN_RPC and CHAIN_WS. The script
now tolerates both hostname-only and full-URL forms.
Copilot + Claude review feedback on PR #509:

1. resolveEventTriggerTemplates: error on unresolved templates instead
   of silently passing through. preprocessTextWithVariableMapping
   returns the original {{...}} string when the referenced variable is
   missing; the previous "if resolved == addr/topic { continue }"
   short-circuit let those literals leak into the persisted trigger,
   reintroducing the operator wildcard bug. Now: always validate the
   resolved value (IsHexAddress for addresses, validateTopicHexFormat
   for topics), so a no-op resolution becomes a hard error. Adds two
   negative test cases.

2. CLAUDE.md: replace hardcoded /Users/mikasa/Code/ava-sdk-js/examples
   path with <path-to-ava-sdk-js>/examples placeholder.

3. trigger/event.go: expand the FilterLogs cross-check comment to
   clarify the goroutine fires once per subscription creation, not on
   a timer or per event. Cost is bounded by subscription churn.

4. fee_estimator.go: document that hasOnChainNodes only inspects the
   immediate Loop runner. Sufficient today (proto disallows nested
   Loops) but flagged as a future risk.

5. aggregator/auth.go: clarify error message for invalid JWT subject
   (was "zero address", now "invalid/non-address subject").

6. cmd/createAdminKey.go: typo in command Short — "long live" →
   "long-lived", "userdata" → "user data".
@chrisli30 chrisli30 merged commit f91ff27 into main Apr 8, 2026
18 checks passed
chrisli30 pushed a commit that referenced this pull request Apr 9, 2026
The verifier (aggregator/auth.go::verifyAuth, tightened in #509) requires
the JWT to have an `aud` claim containing the chain ID string. But
`CreateAdminKey` was building the JWT with only ExpiresAt, Issuer,
Subject, and Roles. As a result, every key generated by the
`create-api-key` CLI was rejected with "API key is invalid" — breaking
ava-sdk-js's test:core CI which generates a fresh key per run.

Dial the eth RPC inside CreateAdminKey to fetch the chain ID (NewAggregator
does not run the lifecycle init that would populate agg.chainID), then
add `Audience: jwt.ClaimStrings{chainID.String()}` to the registered claims.

Discovered while debugging AvaProtocol/ava-sdk-js#209 CI failures.
chrisli30 pushed a commit that referenced this pull request Apr 9, 2026
GetKey (the API key exchange handler used by SDK authWithAPIKey) calls
VerifyJwtKeyForUser, which had two bugs that combined to reject every
key generated by the modern create-api-key CLI:

1. It only treated the literal string "apikey" as an admin subject,
   but PR #509 changed CreateAdminKey to require an EOA address as
   the JWT subject — so admin keys never matched the role-based path.
2. For EOA-subject JWTs, even when claimAddress == userWallet, the
   function fell through to "Malform JWT Key Claim" because the
   address-match branch was missing a `return true, nil`.

Add an EOA-subject branch that mirrors the modern verifyAuth contract:
admin-role keys may manage any wallet; non-admin keys are valid only
when subject == userWallet.

Discovered while debugging AvaProtocol/ava-sdk-js#209 CI failures.
chrisli30 pushed a commit that referenced this pull request Apr 9, 2026
GetKey (the API key exchange handler used by SDK authWithAPIKey) calls
VerifyJwtKeyForUser, which had two bugs that combined to reject every
key generated by the modern create-api-key CLI:

1. It only treated the literal string "apikey" as an admin subject,
   but PR #509 changed CreateAdminKey to require an EOA address as
   the JWT subject — so admin keys never matched the role-based path.
2. For EOA-subject JWTs, even when claimAddress == userWallet, the
   function fell through to "Malform JWT Key Claim" because the
   address-match branch was missing a `return true, nil`.

Add an EOA-subject branch that mirrors the modern verifyAuth contract:
admin-role keys may manage any wallet; non-admin keys are valid only
when subject == userWallet.

Discovered while debugging AvaProtocol/ava-sdk-js#209 CI failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants