release: staging → main (event trigger fixes, fee classification, JWT API key, sentry logging)#509
Conversation
These values embed credentials (Dwellir API key, bundler URL) and were previously exposed as plaintext repo variables. Move them to environment secrets so they are masked in logs and encrypted at rest.
…l cause SentryLogger only uses tag values typed as `error` as the captured exception; passing `err.Error()` made Sentry fall back to the literal log message, collapsing distinct failures into one issue with an unactionable title.
The auth layer derives a smart wallet from the JWT `sub` claim, so a non-address subject (e.g. the previous "admin" default) silently coerced to the zero address. Validate the subject is a hex address both at key creation and at token verification, and propagate creation errors as a non-zero exit from the CLI.
The EventTrigger query topics array supports template literals like
"{{settings.runner}}" that should be substituted against the task's
inputVariables before persistence. The simulate path
(run_node_immediately.go) does this resolution; CreateTask did not.
Result: deployed event-triggered tasks shipped the literal string
"{{settings.runner}}" to the operator. The operator could not pad it as
a 32-byte topic, the slot effectively became a wildcard, and the filter
matched any event on the contract instead of the intended to=runner
constraint. Tasks fired on unrelated transfers and exhausted their
max_execution budget on noise events.
Add resolveEventTriggerTemplates() that mirrors the simulate-path
resolver: walks query addresses and topics, runs each {{...}}-bearing
string through the same VM-based preprocessor, validates the result
(common.IsHexAddress for addresses, validateTopicHexFormat for topics),
and mutates the proto in place before storage. Called from CreateTask
immediately after model.NewTaskFromProtobuf and before persistence so
the operator receives a fully-resolved trigger via the sync stream.
Includes regression tests for topic-address resolution, manual-trigger
no-op, and nil-trigger safety.
Add a section under Storage describing how to use the ava-sdk-js examples/example.ts CLI (yarn start getWorkflow <task_id>) to inspect the live aggregator's BadgerDB state. CreateTask's log lines only emit a high-level summary; the SDK getWorkflow path returns the full stored trigger payload (topics, addresses, inputVariables verbatim), which is the source of truth when debugging trigger filter issues.
Adds three diagnostic surfaces to the operator's event trigger loop to make WS log-stream health visible: 1. Startup confirmation log when the main goroutine enters its select loop, so we can distinguish "never started" from "started but idle". 2. 60-second heartbeat showing active subscription count, logs received in the last interval, total logs received, and logs channel buffer fill. The crucial signal: active_subscriptions > 0 with logs_received_last_60s == 0 for several minutes means the WS log stream is silent despite registered subscriptions. 3. Per-subscription FilterLogs cross-check that runs the same FilterQuery against the HTTP RPC for the last ~50 blocks immediately after each new WS subscription is created. If FilterLogs returns matches but the heartbeat shows zero received, the WS layer is broken independently of the filter shape or chain state. These diagnostics were used to characterize a Dwellir Sepolia WS gateway bug where eth_subscribe(logs, ...) with a multi-element topics array silently accepts the subscribe call and returns a real-looking subscription ID but never delivers matching events. The combination of "FilterLogs cross-check returns >0 matches" + "heartbeat shows 0 received" was the smoking gun.
|
Found 5 test failures on Blacksmith runners: Failures
|
PR Review: staging → main (event trigger fixes, fee classification, JWT API key, sentry logging)Overall this is solid work — clear bug fixes with good regression tests and well-documented intent. A few issues worth addressing before merge. Issues1. Hardcoded personal path in CLAUDE.mdThe new "Inspecting Aggregator Data" section in cd /Users/mikasa/Code/ava-sdk-js/examplesThis is a personal machine path. CLAUDE.md is a shared developer document committed to the repo — future contributors won't have this path. Replace with a relative or repo-relative reference, or describe it as 2. Breaking API change in
|
| Severity | Count |
|---|---|
| Should fix before merge | 1 (personal path in CLAUDE.md) |
| Worth addressing | 2 (Value semantics callout, hasOnChainNodes comment) |
| Minor | 1 (diagnostic goroutine comment) |
The core bug fixes (EventTrigger template resolution, fee classification, JWT validation) are well-implemented and well-tested. Recommend fixing the CLAUDE.md path before merging.
There was a problem hiding this comment.
Pull request overview
Release merge from staging → main focused on EventTrigger correctness/observability, aligning value-fee classification between simulation and production, tightening JWT API key subject handling, and improving Sentry error capture.
Changes:
- Fix EventTrigger template resolution before persistence; add loop ExecutionContext propagation and richer EventTrigger diagnostics.
- Rework value-fee classification to use workflow definition (
task.Nodes) instead of runtime gas/steps; add regression tests. - Require JWT
subto be an EOA address; pass error objects (not strings) to structured logger; adjust CI secrets usage.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| core/taskengine/vm.go | Propagate loop ExecutionContext from iteration steps to parent loop step. |
| core/taskengine/vm_runner_contract_write.go | Log error as err (not err.Error()) for better Sentry cause capture. |
| core/taskengine/value_fee_simulation_test.go | Engine-level SimulateTask regression tests for value-fee classification. |
| core/taskengine/trigger/event.go | Add EventTrigger heartbeat and WS-vs-HTTP FilterLogs cross-check diagnostics. |
| core/taskengine/tenderly_client.go | Add SimulateEventTriggerWithDecimals and scale synthetic Transfer amounts by decimals. |
| core/taskengine/shared_event_enrichment.go | Ensure Transfer value is raw base units and add valueFormatted when decimals known. |
| core/taskengine/run_node_immediately.go | Deprecate applyDecimalsTo for Transfer, compute valueFormatted, pass token decimals into Tenderly simulation. |
| core/taskengine/loop_helpers.go | Add propagateLoopExecutionContext helper. |
| core/taskengine/loop_helpers_test.go | Unit tests for loop ExecutionContext propagation behavior. |
| core/taskengine/fee_estimator.go | Introduce hasOnChainNodes; classify value fee from workflow definition; update buildValueFee signature. |
| core/taskengine/fee_estimator_test.go | Add regression test ensuring buildValueFee uses definition not gas. |
| core/taskengine/executor.go | Use task.Nodes for buildValueFee in real executions. |
| core/taskengine/event_trigger_constants.go | Update Transfer response schema: value raw + new valueFormatted. |
| core/taskengine/engine.go | Resolve EventTrigger templates on CreateTask; use nodes-based value-fee classification in SimulateTask. |
| core/taskengine/engine_resolve_event_trigger_test.go | Regression tests for CreateTask-time EventTrigger template resolution. |
| core/apqueue/worker.go | Log error as err object for Sentry. |
| cmd/createAdminKey.go | Require subject to be an EOA address; exit non-zero on failure; improve command docs. |
| CLAUDE.md | Document how to inspect stored workflow/task state via ava-sdk-js CLI. |
| aggregator/task_engine.go | Log error as err object for Sentry. |
| aggregator/rpc_server.go | Log error as err object for Sentry. |
| aggregator/key.go | Validate JWT subject is a hex EOA address when creating admin key. |
| aggregator/auth.go | Enforce JWT sub is a valid hex address before deriving smart wallet. |
| aggregator/aggregator.go | Log migration error as err object for Sentry. |
| .github/workflows/run-test-on-pr.yml | Read CHAIN_ENDPOINT and BUNDLER_RPC from GitHub Secrets instead of Vars. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if !strings.Contains(addr, "{{") { | ||
| continue | ||
| } | ||
| resolved := tempVM.preprocessTextWithVariableMapping(addr) | ||
| if resolved == addr { | ||
| continue | ||
| } |
There was a problem hiding this comment.
resolveEventTriggerTemplates currently skips validation when a string contains "{{" but preprocessTextWithVariableMapping returns it unchanged. That means unresolved templates (e.g. missing inputVariables) can still be persisted into q.Addresses, reintroducing the deploy-time/operator wildcard bug. Instead of continue on resolved == addr, return an error indicating the template could not be resolved (or always validate and fail when the result is not a hex address).
| } | ||
| resolved := tempVM.preprocessTextWithVariableMapping(topic) | ||
| if resolved == topic { | ||
| continue |
There was a problem hiding this comment.
Same issue for topics: if topic contains "{{" but resolution is a no-op, the code continues and persists the unresolved template string. This bypasses validateTopicHexFormat and can ship invalid/unresolved topic values to the operator. Consider treating resolved == topic as an error when the original contains template markers, and/or always validate the (resolved) topic whenever the original contained "{{".
| continue | |
| return fmt.Errorf("query[%d].topics[%d]: unresolved template in topic value: %s", queryIdx, i, topic) |
| // DIAGNOSTIC: Cross-check the WS subscription against a | ||
| // one-shot historical FilterLogs over the last ~50 blocks | ||
| // via the HTTP RPC client. If FilterLogs returns >0 events | ||
| // but the WS subscription delivers 0 (heartbeat shows 0 | ||
| // received), the WS layer is the problem, not the filter. | ||
| go func(q ethereum.FilterQuery, desc string) { | ||
| headerCtx, headerCancel := context.WithTimeout(ctx, 5*time.Second) |
There was a problem hiding this comment.
The FilterLogs cross-check runs automatically in a goroutine for every newly created subscription. This adds extra RPC calls (HeaderByNumber + FilterLogs over ~50 blocks) that can be expensive if many tasks/subscriptions are created or frequently updated. Consider gating this behind a debug/diagnostic config flag, sampling, or a rate limiter to avoid unexpected RPC load in production.
|
|
||
| subject, _ := claims["sub"].(string) | ||
| if !common.IsHexAddress(subject) { | ||
| r.config.Logger.Error("API key has non-address subject; refusing to derive smart wallet for zero address", |
There was a problem hiding this comment.
The log message says "refusing to derive smart wallet for zero address", but this branch triggers for any non-hex subject (including empty or malformed strings). Updating the message to reflect "invalid/non-address subject" will make auth failures easier to diagnose.
| r.config.Logger.Error("API key has non-address subject; refusing to derive smart wallet for zero address", | |
| r.config.Logger.Error("API key has invalid/non-address subject; refusing to derive smart wallet", |
| @@ -11,15 +14,20 @@ var ( | |||
| createApiKey = &cobra.Command{ | |||
| Use: "create-api-key", | |||
| Short: "Create a long live JWT key to interact with userdata of AVS", | |||
There was a problem hiding this comment.
Command description typo: "long live" should be "long-lived" (and typically "user data" instead of "userdata").
| Short: "Create a long live JWT key to interact with userdata of AVS", | |
| Short: "Create a long-lived JWT key to interact with user data of AVS", |
… gen The Test-environment secret CHAIN_ENDPOINT may be set with or without a leading scheme depending on how it was provisioned. The previous script unconditionally prepended https://, producing malformed URLs like https://https//api-... when the secret already contained the scheme. config.NewConfig then failed to parse the resulting yaml, testConfig was nil, and every test that depended on it panicked. Strip any leading https://, http://, wss://, ws:// from CHAIN_ENDPOINT before composing CHAIN_RPC and CHAIN_WS. The script now tolerates both hostname-only and full-URL forms.
Copilot + Claude review feedback on PR #509: 1. resolveEventTriggerTemplates: error on unresolved templates instead of silently passing through. preprocessTextWithVariableMapping returns the original {{...}} string when the referenced variable is missing; the previous "if resolved == addr/topic { continue }" short-circuit let those literals leak into the persisted trigger, reintroducing the operator wildcard bug. Now: always validate the resolved value (IsHexAddress for addresses, validateTopicHexFormat for topics), so a no-op resolution becomes a hard error. Adds two negative test cases. 2. CLAUDE.md: replace hardcoded /Users/mikasa/Code/ava-sdk-js/examples path with <path-to-ava-sdk-js>/examples placeholder. 3. trigger/event.go: expand the FilterLogs cross-check comment to clarify the goroutine fires once per subscription creation, not on a timer or per event. Cost is bounded by subscription churn. 4. fee_estimator.go: document that hasOnChainNodes only inspects the immediate Loop runner. Sufficient today (proto disallows nested Loops) but flagged as a future risk. 5. aggregator/auth.go: clarify error message for invalid JWT subject (was "zero address", now "invalid/non-address subject"). 6. cmd/createAdminKey.go: typo in command Short — "long live" → "long-lived", "userdata" → "user data".
The verifier (aggregator/auth.go::verifyAuth, tightened in #509) requires the JWT to have an `aud` claim containing the chain ID string. But `CreateAdminKey` was building the JWT with only ExpiresAt, Issuer, Subject, and Roles. As a result, every key generated by the `create-api-key` CLI was rejected with "API key is invalid" — breaking ava-sdk-js's test:core CI which generates a fresh key per run. Dial the eth RPC inside CreateAdminKey to fetch the chain ID (NewAggregator does not run the lifecycle init that would populate agg.chainID), then add `Audience: jwt.ClaimStrings{chainID.String()}` to the registered claims. Discovered while debugging AvaProtocol/ava-sdk-js#209 CI failures.
GetKey (the API key exchange handler used by SDK authWithAPIKey) calls VerifyJwtKeyForUser, which had two bugs that combined to reject every key generated by the modern create-api-key CLI: 1. It only treated the literal string "apikey" as an admin subject, but PR #509 changed CreateAdminKey to require an EOA address as the JWT subject — so admin keys never matched the role-based path. 2. For EOA-subject JWTs, even when claimAddress == userWallet, the function fell through to "Malform JWT Key Claim" because the address-match branch was missing a `return true, nil`. Add an EOA-subject branch that mirrors the modern verifyAuth contract: admin-role keys may manage any wallet; non-admin keys are valid only when subject == userWallet. Discovered while debugging AvaProtocol/ava-sdk-js#209 CI failures.
GetKey (the API key exchange handler used by SDK authWithAPIKey) calls VerifyJwtKeyForUser, which had two bugs that combined to reject every key generated by the modern create-api-key CLI: 1. It only treated the literal string "apikey" as an admin subject, but PR #509 changed CreateAdminKey to require an EOA address as the JWT subject — so admin keys never matched the role-based path. 2. For EOA-subject JWTs, even when claimAddress == userWallet, the function fell through to "Malform JWT Key Claim" because the address-match branch was missing a `return true, nil`. Add an EOA-subject branch that mirrors the modern verifyAuth contract: admin-role keys may manage any wallet; non-admin keys are valid only when subject == userWallet. Discovered while debugging AvaProtocol/ava-sdk-js#209 CI failures.
Release contents
8 commits from staging since the last main merge (plus 3 follow-ups in this PR):
9f933212dd4f71de3c15f7c052143371a57{{settings.*}}templates in CreateTask54b0f0558325bee992701db8c735ceb690eTransferEventResponse.ValuesemanticsTransferEventResponse.Valueis now the raw uint256 base-units string, not the formatted decimal-applied amount it used to be. The formatted-decimal value is now exposed as a new fieldValueFormatted. Any client code doing arithmetic or display onvaluedirectly will break for tokens with non-zero decimals (e.g. USDC):Origin: commit
ceb690e(already merged via #507). The wire-proto fieldvaluewas repurposed; no proto field rename happened so there is no protobuf compatibility break, but the semantic meaning of the field changed. SDK consumers must be updated to readvalueFormattedfor human-readable amounts. This is the main thing reviewers should sanity-check before approving.Highlights
EventTrigger correctness fix (
3371a57+ follow-up9f93321)engine.CreateTaskwas persisting EventTriggertopics[]verbatim, including unresolved template literals like"{{settings.runner}}". The operator received the literal string, could not pad it as a 32-byte topic, and the slot effectively became a wildcard — so deployed event-triggered tasks fired on unrelated transfers and exhausted theirmax_executionbudget on noise. This commit addsresolveEventTriggerTemplates()mirroring the simulate-path resolver and calls it before persistence. The follow-up fix removes the no-op short-circuit so an unresolvable{{...}}(e.g. missinginputVariableskey) is now a hard error instead of a silent passthrough. Includes 5 regression tests (resolved-OK, manual-trigger no-op, nil-trigger, unresolved-topic-errors, unresolved-address-errors).Operator observability (
de3c15f)Three new diagnostic surfaces in the EventTrigger main loop: startup confirmation, 60-second heartbeat (active subs / logs received / channel buffer), and per-subscription
FilterLogscross-check against the HTTP RPC. These let us distinguish "loop never started" / "loop alive but WS silent" / "WS delivering but filter wrong" definitively. Used to characterize a Dwellir Sepolia WS gateway bug (out-of-tree workaround: swap to Tenderly Sepolia in local configs).CI test config gen fix (
2dd4f71)The
Test-environment secretCHAIN_ENDPOINTmay be set with or without a leadinghttps://scheme. The previousgenerate-test-config.shunconditionally prependedhttps://, producinghttps://https//api-...and breaking config parsing. Now strips any leading scheme before prepending — tolerant to both forms.Storage migration check
go run scripts/compare_storage_structure.go mainresults:TransferEventResponse(see Breaking API change section above) — runtime response shape, no BadgerDB data migration required.Test plan
go test ./core/taskengine/ -run TestResolveEventTriggerTemplates— 5 regression tests pass (3 happy-path + 2 negative-case for unresolved templates)TransferEventResponse.ValueFormattedinstead ofValuefor human amountsmake testafter CI re-runs with the test-config-gen fix