Skip to content

Loop runner discards partial-success results array on per-iteration failure, returning data: null instead of [null, {...}] #511

@chrisli30

Description

@chrisli30

Summary

The Loop runner builds a partial-success results array (with nil for failed iterations) but then discards it on the error return path, causing the SDK to receive data: null instead of the per-iteration array.

This breaks the documented contract where a Loop with one failing inner-node iteration and one successful iteration should return [null, {…}] and report partial success.

Reproduction

Test in ava-sdk-js:

// tests/nodes/loopNode.test.ts:2380
test("should return flattened data format for contract_read runner in loop", ...)

The test creates a Loop whose Runner is a ContractRead with two contracts:

  1. Chainlink ETH/USD price feed (does not have name()/symbol()) → expected to revert per iteration → expected null entry
  2. USDC (has name()/symbol()) → expected to succeed → expected {name: "USDC", symbol: "USDC"} entry

Expected SDK response:

{
  success: true, // or partial — see "Open question" below
  data: [null, { name: "USDC", symbol: "USDC" }],
  ...
}

Actual SDK response:

{
  success: false,
  data: null,
  error: 'invalid request: contract call failed: execution reverted',
  errorCode: undefined,
  metadata: undefined,
  executionContext: undefined
}

The test fails deterministically in ~1.4s on every run, both in isolation and as part of yarn test:nodes.

Root cause

In core/taskengine/vm.go, the loop execution path:

// Lines ~4711-4738 (parallel) and ~4771-4796 (sequential)
for i, resultChannel := range resultChannels {
    select {
    case result := <-resultChannel:
        if result.Error != nil {
            success = false
            if firstError == nil {
                firstError = result.Error
            }
            log.WriteString(fmt.Sprintf("\nError in iteration %d: %s", i, result.Error.Error()))
            // ← results[i] is left as nil (intentional partial-success placeholder)
        } else {
            results[i] = result.Data
        }
        // ...
    case <-time.After(iterationTimeout):
        // similar
    }
}

After the loop, partial-success aggregation works correctly:

// Lines ~4811-4824
jsonSerializableResults := make([]interface{}, len(results))
for i, result := range results {
    if result != nil {
        jsonSerializableResults[i] = convertToJSONCompatible(result)
    } else {
        jsonSerializableResults[i] = nil  // ← preserves the per-iteration null
    }
}
outputValue, err := structpb.NewValue(jsonSerializableResults)
// ...
loopOutput := &avsproto.LoopNode_Output{Data: outputValue}
s.OutputData = &avsproto.Execution_Step_Loop{Loop: loopOutput}

At this point s.OutputData correctly holds [null, {name: "USDC", symbol: "USDC"}]. But then:

// Lines ~4833-4836
if !success && firstError != nil {
    finalizeStep(s, false, nil, firstError.Error(), log.String())
    return s, firstError
}

finalizeStep(s, true, nil, "", log.String())
return s, nil

The loop returns firstError whenever any iteration failed, and the runNodeWithInputs / runNodeImmediately response builder sees the non-nil error and discards s.OutputData, replacing the response's data field with null. The partial-success array that was just built is silently thrown away.

Proposed fix

The intent of results[i] = nil on a per-iteration error is clearly partial-success bookkeeping. The fix is to honor that intent on the return path:

  1. Distinguish between "loop infrastructure failure" (queue submit error, timeout collecting results, internal panic) and "per-iteration runner failure" (the inner ContractRead/ContractWrite reverted).
  2. For per-iteration runner failures, finalize the step as success or partial-success, preserve OutputData, and return (s, nil) so downstream sees the data.
  3. For infrastructure failures, keep the current behavior (return the error).

Sketch:

allFailed := success == false && countNonNilResults(results) == 0
hardInfraFailure := loopSetupErr != nil // not the per-iteration errors

if hardInfraFailure {
    // current behavior — propagate
    finalizeStep(s, false, nil, firstError.Error(), log.String())
    return s, firstError
}

if allFailed {
    // every iteration failed — finalize as failed but still preserve the
    // results array so the client can see which iterations failed and why
    finalizeStep(s, false, nil, firstError.Error(), log.String())
    return s, nil // ← key change: return nil error so OutputData is preserved
}

// At least one iteration succeeded → partial or full success
finalizeStep(s, true, nil, "", log.String())
return s, nil

The exact success: true vs success: partial decision is a separate API choice (see "Open question" below), but at minimum the response data field must contain the per-iteration array, not null.

Open question: success vs partial-success vs failure

Three possible response shapes for "one iteration succeeded, one failed":

Choice success field data field error field
A true [null, {…}] ""
B partial (new tri-state) [null, {…}] "iteration 0 failed: …"
C false [null, {…}] "iteration 0 failed: …"

Currently the operator effectively does D: success: false, data: null, error: "..." — which loses information.

The test expects A (success: true), suggesting the convention is "if the loop ran to completion, it succeeded; per-iteration outcomes are reflected in the data array, not the top-level success flag." That seems consistent with how loop runners are commonly used (you want to process all items and inspect failures after, not bail on first error). I'd recommend A, but B or C also fix the bug as long as data is preserved.

Impact

  • ava-sdk-js test loopNode > flattened data format for contract_read runner in loop has been failing deterministically since at least the 720b0a2 commit that added it. Currently the only test exercising the documented per-iteration partial-failure contract.
  • Any production workflow that uses Loop > ContractRead against a heterogeneous list of contracts (e.g., reading metadata for a list of arbitrary tokens, some of which might not implement the requested method) silently loses all results when any one contract reverts. The user has no way to know which contracts succeeded — the entire array is null.
  • Symmetric bug likely exists in Loop > ContractWrite and Loop > GraphQL/Loop > REST runners — anywhere the per-iteration runner can fail individually. Worth grepping for the same pattern in those code paths.

Repro environment

Related code

  • core/taskengine/vm.go — loop result aggregation (lines ~4690-4840)
  • core/taskengine/loop_helpers.go — per-iteration template processing
  • The downstream runNodeWithInputs response builder that drops OutputData when the step returns a non-nil error

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions