Skip to content

trap host function to avoid wrong atomically() behavior#3309

Open
vigoo wants to merge 3 commits intomainfrom
atomically-fix
Open

trap host function to avoid wrong atomically() behavior#3309
vigoo wants to merge 3 commits intomainfrom
atomically-fix

Conversation

@vigoo
Copy link
Copy Markdown
Contributor

@vigoo vigoo commented Apr 29, 2026

Before this change, atomically { ... } worked in the following way:

  • emit a begin atomic region marker in the oplog
  • if succeeds, emit an end atomic region marker in the oplog
  • if it fails (= trap) the regions is not closed, and when Golem replays the oplog, it detects the missing end marker and invalidates the partial oplog region with a jump, switches back to live and by that retries the atomic region from scratch

This was always mostly tested with Rust with panics as the failure source, in which case it works well.

The problem is in the TS SDK (and Scala) errors are manifested as JS exceptions, thrown out of the atomic region, and only becoming a trap if uncaught on top level.

Until recently (within 2 weeks) the TS atomically was always closing the atomic region, even when the inner promise was failed. This made it completely unusable as in that case during recovery the atomic region is considered successfully closed and the switch-to-live event is not happening. So I changed this before to not close the region in case of a failure, which matches the rust sdk behavior.

The problem is that the exception can be caught outside of the atomic region and that's what Maxim's example was doing:

export async function llmCall(
  client: OpenAI,
  params: OpenAI.Chat.ChatCompletionCreateParamsNonStreaming,
): Promise<string> {
  let lastError: unknown;

  for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt++) {
    if (attempt > 0) {
      await sleep(backoffMs(attempt - 1));
    }

    try {
      // Each attempt is its own atomic block so Golem records a clean
      // checkpoint rather than resuming from an incomplete HTTP write.
      const content = await atomically(async () => {
        const response = await client.chat.completions.create(params);
        return response.choices[0]?.message?.content ?? '';
      });
      return content;
    } catch (err) {
      lastError = err;
      if (!isRetryable(err)) {
        throw err;
      }
      console.error(
        `LLM call failed (attempt ${attempt + 1}/${MAX_ATTEMPTS}): ` +
        `${err instanceof APIError ? `HTTP ${err.status} ${err.message}` : String(err)}`
      );
    }
  }

  throw lastError;
}

here in case of a failure, the atomic region is NOT closed, but we never trap because we catch the exception and by that we leave the atomic region open which leads to all kind of unwanted behaviors.

I was investigating ways to make this work, it could be probably possible to introduce new oplog entries and host functions to record the exception and close the region with that, and then during replay re-emit the exception so our external retry path gets replayed - I decided this is too hard to do, especially because it requires reconstructing the exception completely, and adding new oplog entries.

So what I did in this PR instead is to never allow exceptions out of the atomically block - which will need to be further documented in a follow-up docs/skills PR - and always trap. To be able to trap in JS without having to bubble the exception to the top, I introduced a trap host function. This way the atomic regions work as intended in all languages.

@mschuwalow
Copy link
Copy Markdown
Contributor

Note that the outer retry was only needed in the example because retryPolicy did not work here (as it was automatically closed, making the error happen outside the retrypolicy region).

But I think the changes make sense either way

@mschuwalow
Copy link
Copy Markdown
Contributor

Does it make sense to apply the same "trap when an error bubbles up" to the retry policies? I think it would solve all issues we have there as well

@vigoo
Copy link
Copy Markdown
Contributor Author

vigoo commented Apr 30, 2026

Does it make sense to apply the same "trap when an error bubbles up" to the retry policies? I think it would solve all > issues we have there as well

maybe, but I'd like to think that through separately (there is another ticket about that problem)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check how http retry is falling to "non-idempotent non-retryable" case, may be too strict

2 participants