fix(ilp): surface WebSocket frame-size errors to sender callers by jerrinot · Pull Request #13 · questdb/java-questdb-client

jerrinot · 2026-04-21T14:18:56Z

Improve QWP/WebSocket sender error reporting so failures observed by the I/O thread are surfaced back to the caller instead of getting hidden behind generic queue state errors.

The main user-visible case is an oversized server response frame: callers now get the underlying WebSocket/QWP failure, with actionable context, when they next call into the sender.

Also clarifies the failure contract: after a WebSocket connection is established, send failures, ACK errors, server error ACKs, invalid ACKs, timeouts, or server closes put the sender into a terminal failed state. The sender retains the first failure and rethrows it from subsequent public calls; callers should close it and create a new sender to resume sending.

Report oversized WebSocket frames with the received frame size, configured max size, and guidance to decrease batch size. Also propagate terminal WebSocket sender failures at Sender level so async ACK errors surface promptly.

Swap them for zero-GC Utf8s helpers.

…ting

bluestreak01 · 2026-04-22T10:40:16Z

@jerrinot — critical review follow-up. After double-checking, here are the confirmed findings (a couple of my earlier points turned out to be wrong and have been withdrawn).

Confirmed concerns worth addressing before merge

1. Terminal sender state is permanent — document the contract
connected is set to true once in ensureConnected and never reset, so once connectionError is non-null, every subsequent call hits checkNotClosed() → checkConnectionError(). The user has no recovery API and must construct a new sender. The test testConnectionFailureIsSenderLevelTerminalState confirms this is intentional, but the contract should be in the Javadoc of QwpWebSocketSender / flush().

2. Stale stack trace on rethrow
checkConnectionError() does throw error; where error is the stored LineSenderException. In Java, re-throwing a pre-constructed exception does not update its stack trace. So every subsequent user-thread call that fails shows the original I/O-thread stack, not the call site — confusing for users debugging table() / symbol() calls that appear to fail "inside the WebSocket I/O loop." Either document this, or call error.fillInStackTrace() before rethrow (O(depth), only on the failure path).

3. PR title mismatch
Title says "fix(ilp): report actionable QWP websocket frame-size errors", but that work already shipped in be79e84. This PR's actual contents are zero-GC UTF-8 helpers + terminal sender-level connection error propagation. Retitle for reviewers / future git log readers.

Smaller items (not blockers)

4. WebSocketSendBuffer.putUtf8 wastes work on non-ASCII strings — core/src/main/java/io/questdb/client/cutlass/http/client/WebSocketSendBuffer.java:333-354
The ASCII fast-path encodes the prefix, then on the first non-ASCII char falls back to NativeBufferWriter.utf8Length(value) + Utf8s.strCpyUtf8(value, ...) which re-scan and re-encode from index 0. A string with é at position 1000 is effectively double-scanned. Unsafe.realloc preserves the ASCII bytes, so correctness is fine — but the "single-pass ASCII path" comment is misleading. Easy fix: pass i into the fallback and resume encoding from the first non-ASCII char (mirroring what NativeBufferWriter.putUtf8 already does).

5. WebSocketResponse scans errorMessage up to 3× per response
Callers typically invoke serializedSize() + writeTo(), which scan with utf8Bytes twice and then strCpyUtf8 once. Not hot path (error response only), but worth noting.

6. Utf8s.utf8Bytes(CharSequence) and utf8Bytes(CharSequence, int) duplicate the branching logic
Minor DRY — the unlimited variant could delegate to the limited one with Integer.MAX_VALUE.

7. WebSocketSendQueue.failConnection is non-atomic, safe only because it's I/O-thread-only
The read-modify-write of lastError is correct because all callers (safeSendBatch, tryReceiveAcks, ResponseHandler.onBinaryMessage, ResponseHandler.onClose) run on the I/O thread. A one-line comment stating this invariant would protect against a future refactor that moves a call site off-thread and silently breaks the first-failure guarantee.

8. checkConnectionError() in catch blocks — document intent
In flush(), flushSync(), and sealAndSwapBuffer() catch blocks, checkConnectionError() is called before throw e. If it throws, e is silently shadowed. This is actually the desired behavior: WebSocketSendQueue.failConnection sets lastError before invoking the listener, so by the time the user thread reaches the catch, connectionError holds the original unwrapped error (cleaner message than e, which is checkError()'s wrapped version). A one-line comment on the pattern would make the intent obvious.

9. WebSocketResponse.readFrom still allocates per error ACK
Line 202 new byte[msgLen] + line 206 new String(...). Out of scope for this PR, but the remaining GC site on the response path if you want the pair to be symmetric with the zero-GC send side.

Points I initially raised but had to withdraw

"Race on connectionError.set(null) during reconnect" — not reachable. connected never resets, so ensureConnected's inner block runs at most once per sender, and on first entry connectionError is already null. The I/O thread starts in IDLE state (new inFlightWindow count=0, pending buffer null) and immediately wait(100) on the lock — it can't fire recordConnectionFailure during that window.
"Redundant checkError() calls in enqueue" — not redundant. Comparing old vs. new pattern, the extras are deliberate: prefer surfacing the I/O root cause over the generic "Send queue is not running" message, and the inner call after !running covers the race window where the I/O thread fails mid-method. My suggestion to drop the outer call was wrong.

Recommendation

Land after (1), (2), (3). The rest are comments / follow-ups.

length field for non-OK responses, including empty messages.

mtopolnik · 2026-04-22T14:53:18Z

[PR Coverage check]

😍 pass : 153 / 179 (85.47%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/client/cutlass/qwp/client/QwpWebSocketSender.java	23	35	65.71%
🔵	io/questdb/client/cutlass/http/client/WebSocketSendBuffer.java	14	20	70.00%
🔵	io/questdb/client/cutlass/qwp/client/WebSocketSendQueue.java	17	22	77.27%
🔵	io/questdb/client/cutlass/qwp/client/WebSocketResponse.java	16	19	84.21%
🔵	io/questdb/client/cutlass/qwp/client/NativeBufferWriter.java	6	6	100.00%
🔵	io/questdb/client/std/str/Utf8s.java	77	77	100.00%

Wires the drainer runtime onto the orphan-scanner foundation. With drain_orphans=true the foreground sender now actually empties sibling slots holding unacked data instead of just logging that they exist. Per-drainer lifecycle: 1. Open CursorSendEngine on the slot — its constructor takes the slot lock; if another sender or drainer holds it, the engine throws and the drainer exits silently (LOCKED_BY_OTHER, not a failure). 2. Open a fresh WebSocketClient via the foreground sender's connect factory — separate connection, same auth/host/port/TLS config. 3. Run a CursorWebSocketSendLoop until ackedFsn catches up to the publishedFsn snapshot taken at startup. 4. On terminal failure (auth, recovery, budget), drop a .failed sentinel into the slot. Future scans skip it until an operator clears it manually — bounded retry, then human-in-the-loop. Pool: bounded fixed-thread executor, daemon threads, sized by max_background_drainers (default 4). Closes via cooperative stop + 3s grace; daemon threads ensure no JVM-exit blocking. Visibility: QwpWebSocketSender#getBackgroundDrainers returns a snapshot list of live drainers with {slot, target, acked, outcome, lastError}. Test: ghost sender writes 30 distinct rows against a silent server and closes fast — leaves an unacked slot. Foreground sender opens the same group root with a different sender_id and drain_orphans=true against an ack server; asserts every distinct payload reaches the new server. Plus a sentinel-skip test confirming an operator-set .failed file disqualifies the slot from the next foreground run's scan. Empty active segments and stale hot spares are left in the slot dir per spec decision #13 ("no automatic cleanup of empty slot dirs"); the scanner's no-op behavior on empty slots makes this cheap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(ilp): report actionable QWP websocket frame-size errors

be79e84

Report oversized WebSocket frames with the received frame size, configured max size, and guidance to decrease batch size. Also propagate terminal WebSocket sender failures at Sender level so async ACK errors surface promptly.

jerrinot marked this pull request as draft April 21, 2026 14:20

jerrinot marked this pull request as ready for review April 21, 2026 14:21

jerrinot added 2 commits April 21, 2026 17:10

the close-frame and bad-request paths allocated a byte[] per response.

90b0b70

Swap them for zero-GC Utf8s helpers.

Merge remote-tracking branch 'origin/main' into jh_better_error_repor…

d4f449f

…ting

jerrinot added 4 commits April 22, 2026 14:47

contract clarification

be30924

refresh QWP WebSocket terminal error stack traces

ab680be

optimize UTF-8 fallback after ASCII prefix

6a6698d

cache encoded error-message length and always emit the error

699fe26

length field for non-OK responses, including empty messages.

jerrinot changed the title ~~fix(ilp): report actionable QWP websocket frame-size errors~~ fix(qwp): surface WebSocket frame-size errors to sender callers Apr 22, 2026

jerrinot changed the title ~~fix(qwp): surface WebSocket frame-size errors to sender callers~~ fix(ilp): surface WebSocket frame-size errors to sender callers Apr 22, 2026

bluestreak01 approved these changes Apr 24, 2026

View reviewed changes

bluestreak01 merged commit f1dd21c into main Apr 24, 2026
14 checks passed

bluestreak01 deleted the jh_better_error_reporting branch April 24, 2026 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ilp): surface WebSocket frame-size errors to sender callers - #13

fix(ilp): surface WebSocket frame-size errors to sender callers#13
bluestreak01 merged 7 commits into
mainfrom
jh_better_error_reporting

jerrinot commented Apr 21, 2026 •

edited

Loading

Uh oh!

bluestreak01 commented Apr 22, 2026

Uh oh!

mtopolnik commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jerrinot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluestreak01 commented Apr 22, 2026

Confirmed concerns worth addressing before merge

Smaller items (not blockers)

Points I initially raised but had to withdraw

Recommendation

Uh oh!

mtopolnik commented Apr 22, 2026

[PR Coverage check]

file detail

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jerrinot commented Apr 21, 2026 •

edited

Loading