Skip to content

[pull] main from triggerdotdev:main#103

Merged
pull[bot] merged 2 commits into
Dustin4444:mainfrom
triggerdotdev:main
May 11, 2026
Merged

[pull] main from triggerdotdev:main#103
pull[bot] merged 2 commits into
Dustin4444:mainfrom
triggerdotdev:main

Conversation

@pull

@pull pull Bot commented May 11, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

ericallam added 2 commits May 11, 2026 07:17
…over (#3548)

## Summary

During an ElastiCache role swap (failover) or node-type change (vertical
scale), the ioredis TCP/TLS connection stays open but the server starts
answering with `READONLY` (the client is talking to a node that became a
replica) or `LOADING` (node still loading data from disk). Without an
explicit hook, those errors surface to caller code as `ReplyError`
instances — every write op on the affected connection fails until the
cluster fully cuts over.

This PR adds `reconnectOnError` to every prod ioredis client so the
disconnect + reconnect + retry cycle absorbs these errors and caller
code never sees them.

## Fix

```ts
export function defaultReconnectOnError(err: Error): boolean | 1 | 2 {
  const msg = err.message ?? "";
  if (msg.startsWith("READONLY") || msg.startsWith("LOADING")) return 2;
  return false;
}
```

Returning `2` tells ioredis to disconnect, reconnect, and re-issue the
failed command. After reconnect, DNS / SG state routes the new socket to
a writable node.

The helper lives in `@internal/redis` and is wired into both the shared
`createRedisClient` (which covers RunQueue, schedule-engine,
redis-worker, and every other internal-package consumer) and the direct
`new Redis(...)` call sites in the webapp.

V1-only marqs files are intentionally not migrated.

## Test plan

- [x] `pnpm run typecheck --filter webapp`
- [x] `pnpm run typecheck --filter @internal/run-engine`
- [x] Verified end-to-end against a live ElastiCache vertical-scale
event — caller-surfaced errors went from tens of thousands during the
cutover window down to a handful per ioredis client
- [ ] Confirm steady-state behavior unchanged after deploy
…3549)

## Summary

When ElastiCache demotes a primary to replica — during a Multi-AZ
failover or a vertical node-type change — the demoting primary issues an
`UNBLOCKED` reply to any in-flight blocking commands (`BLPOP`, `BRPOP`,
`BLMOVE`, `XREADGROUP ... BLOCK`, etc.) to clear them before the role
flips. ioredis surfaces these as `ReplyError` to caller code.

The shared `defaultReconnectOnError` added in #3548 only matches
`READONLY` and `LOADING`. This extends it to `UNBLOCKED` so the
disconnect-reconnect-retry cycle handles BLPOP-shaped errors the same
way the existing two cases handle non-blocking-command errors.

## Fix

```ts
export function defaultReconnectOnError(err: Error): boolean | 1 | 2 {
  const msg = err.message ?? "";
  if (
    msg.startsWith("READONLY") ||
    msg.startsWith("LOADING") ||
    msg.startsWith("UNBLOCKED")
  ) {
    return 2;
  }
  return false;
}
```

Returning `2` tells ioredis to disconnect, reconnect, and re-issue the
command. For a BLPOP that means a fresh BLPOP against the new primary
instead of the `UNBLOCKED` error escaping to the caller.

## Test plan

- [ ] CI green
- [ ] Trigger a Multi-AZ failover or a vertical scale event on an
ElastiCache replication group whose clients are running blocking
commands and confirm no `UNBLOCKED` errors surface to caller code during
the cutover.
@pull pull Bot locked and limited conversation to collaborators May 11, 2026
@pull pull Bot added the ⤵️ pull label May 11, 2026
@pull pull Bot merged commit a5ba406 into Dustin4444:main May 11, 2026
0 of 3 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant