Skip to content

[pull] main from triggerdotdev:main#162

Merged
pull[bot] merged 1 commit into
Dustin4444:mainfrom
triggerdotdev:main
May 23, 2026
Merged

[pull] main from triggerdotdev:main#162
pull[bot] merged 1 commit into
Dustin4444:mainfrom
triggerdotdev:main

Conversation

@pull

@pull pull Bot commented May 23, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…ication (#3708)

## Summary

On a ClickHouse `Cannot parse JSON object` rejection,
`RunsReplicationService` now sanitizes lone UTF-16 surrogates across the
failing batch via the existing `sanitizeRows` helper and retries once.
If the sanitizer found nothing or the retry also fails, the batch is
dropped loudly with a counter increment, so the surrounding
`#insertWithRetry` layer doesn't spin three more times on a
deterministic failure. Non-parse errors propagate unchanged.

Mirrors the pattern from #3659 (for `ClickhouseEventRepository`) — same
root cause (lone UTF-16 surrogates in user-provided JSON), same recovery
shape, **reusing the same shared helpers** (`sanitizeRows`,
`isClickHouseJsonParseError`, `parseRowNumberFromError`).

Fixes the customer-facing symptom from
[TRI-9755](https://linear.app/triggerdotdev/issue/TRI-9755): a single
row's poisoned `output` JSON used to take down the
`COMPLETED_SUCCESSFULLY` UPDATE events for its 50+ batch-mates,
stranding them in `EXECUTING` in ClickHouse forever and inflating
"Running" counts on the Tasks page. Confirmed in production this is
ongoing — ~120k stale rows accumulated in a single 5-hour burst on
2026-05-18; smaller continuous leak before and after.

## What changed

`apps/webapp/app/services/runsReplicationService.server.ts`:

- Imports the three helpers from
`~/v3/eventRepository/sanitizeRowsOnParseError.server` (no duplication;
no move).
- New private `#insertWithJsonParseRecovery<T>(rows, doInsert,
contextLabel, attempt)` — generic over `TaskRunInsertArray[]` and
`PayloadInsertArray[]`, structurally identical to
`ClickhouseEventRepository.#insertWithJsonParseRecovery`. Try → on parse
error sanitize the whole batch (the `at row N` hint is logged but not
used to slice — semantics under `input_format_parallel_parsing` aren't
stable) → retry once → drop with loud log if sanitizer found nothing OR
retry still fails.
- `#insertTaskRunInserts` and `#insertPayloadInserts` extract a
`doInsert` closure and hand it to the wrapper. Existing error logging,
span recording, and `recordSpanError` are preserved inside the closure.
- New `private _permanentlyDroppedBatches = 0` counter with a public
getter, for ops dashboards and tests (matches the events-repo
convention). One shared counter for both insert sites — granularity
comes from the `contextLabel` (`task_runs_v2` /
`raw_task_runs_payload_v1`) on every log line.

`.server-changes/runs-replication-utf16-recovery.md` — release notes
entry.

## Why no new tests

The shared helpers already have full unit + real-ClickHouse contract
coverage from #3659
(`apps/webapp/test/sanitizeRowsOnParseError.test.ts`,
`apps/webapp/test/otlpUtf16Sanitization.integration.test.ts`). The new
wrapper is a line-for-line structural port. Adding a parallel
integration test would require synthesizing bad data that *escapes* the
preemptive `detectBadJsonStrings` check in `#prepareJson` but still
trips ClickHouse — non-trivial without hand-crafted fixtures and
wouldn't cover any new logic.

## What this does NOT do

- Doesn't touch the ~120k existing stale `EXECUTING` rows in production.
That needs a reconciliation/backfill sweep (separate ticket — TRI-9755
fix #3).
- Doesn't sanitize the `error` column path
(`runsReplicationService.server.ts:932 const errorData = { data:
run.error };`). Reactive recovery will catch it if it ever poisons a
batch, but feeding it through `#prepareJson` like `output` is a cheap
follow-up.

## Test plan

- [x] `pnpm run typecheck --filter webapp` — clean
- [ ] Post-deploy: confirm `permanentlyDroppedBatches` counter stays at
zero (or near-zero) in
`/stp/trigger-app-prod/ecs/replication/service-container/process-logs`,
and watch for `Sanitizing batch after ClickHouse JSON parse error` warns
to confirm recovery is firing on real traffic
- [ ] Post-deploy: confirm the rate of new
"EXECUTING-but-actually-COMPLETED" zombies in ClickHouse flattens
(current rate ≈ tens-to-hundreds per hour platform-wide)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@pull pull Bot locked and limited conversation to collaborators May 23, 2026
@pull pull Bot added the ⤵️ pull label May 23, 2026
@pull pull Bot merged commit eefb96c into Dustin4444:main May 23, 2026
0 of 3 checks passed
@pull pull Bot had a problem deploying to dependabot-summary May 24, 2026 09:55 Failure
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant