Skip to content

fix(ilp): surface WebSocket frame-size errors to sender callers#13

Merged
bluestreak01 merged 7 commits intomainfrom
jh_better_error_reporting
Apr 24, 2026
Merged

fix(ilp): surface WebSocket frame-size errors to sender callers#13
bluestreak01 merged 7 commits intomainfrom
jh_better_error_reporting

Conversation

@jerrinot
Copy link
Copy Markdown
Contributor

@jerrinot jerrinot commented Apr 21, 2026

Improve QWP/WebSocket sender error reporting so failures observed by the I/O thread are surfaced back to the caller instead of getting hidden behind generic queue state errors.

The main user-visible case is an oversized server response frame: callers now get the underlying WebSocket/QWP failure, with actionable context, when they next call into the sender.

Also clarifies the failure contract: after a WebSocket connection is established, send failures, ACK errors, server error ACKs, invalid ACKs, timeouts, or server closes put the sender into a terminal failed state. The sender retains the first failure and rethrows it from subsequent public calls; callers should close it and create a new sender to resume sending.

Report oversized WebSocket frames with the received frame size,
configured max size, and guidance to decrease batch size.

Also propagate terminal WebSocket sender failures at Sender level
so async ACK errors surface promptly.
@jerrinot jerrinot marked this pull request as draft April 21, 2026 14:20
@jerrinot jerrinot marked this pull request as ready for review April 21, 2026 14:21
@bluestreak01
Copy link
Copy Markdown
Member

@jerrinot — critical review follow-up. After double-checking, here are the confirmed findings (a couple of my earlier points turned out to be wrong and have been withdrawn).

Confirmed concerns worth addressing before merge

1. Terminal sender state is permanent — document the contract
connected is set to true once in ensureConnected and never reset, so once connectionError is non-null, every subsequent call hits checkNotClosed()checkConnectionError(). The user has no recovery API and must construct a new sender. The test testConnectionFailureIsSenderLevelTerminalState confirms this is intentional, but the contract should be in the Javadoc of QwpWebSocketSender / flush().

2. Stale stack trace on rethrow
checkConnectionError() does throw error; where error is the stored LineSenderException. In Java, re-throwing a pre-constructed exception does not update its stack trace. So every subsequent user-thread call that fails shows the original I/O-thread stack, not the call site — confusing for users debugging table() / symbol() calls that appear to fail "inside the WebSocket I/O loop." Either document this, or call error.fillInStackTrace() before rethrow (O(depth), only on the failure path).

3. PR title mismatch
Title says "fix(ilp): report actionable QWP websocket frame-size errors", but that work already shipped in be79e84. This PR's actual contents are zero-GC UTF-8 helpers + terminal sender-level connection error propagation. Retitle for reviewers / future git log readers.

Smaller items (not blockers)

4. WebSocketSendBuffer.putUtf8 wastes work on non-ASCII stringscore/src/main/java/io/questdb/client/cutlass/http/client/WebSocketSendBuffer.java:333-354
The ASCII fast-path encodes the prefix, then on the first non-ASCII char falls back to NativeBufferWriter.utf8Length(value) + Utf8s.strCpyUtf8(value, ...) which re-scan and re-encode from index 0. A string with é at position 1000 is effectively double-scanned. Unsafe.realloc preserves the ASCII bytes, so correctness is fine — but the "single-pass ASCII path" comment is misleading. Easy fix: pass i into the fallback and resume encoding from the first non-ASCII char (mirroring what NativeBufferWriter.putUtf8 already does).

5. WebSocketResponse scans errorMessage up to 3× per response
Callers typically invoke serializedSize() + writeTo(), which scan with utf8Bytes twice and then strCpyUtf8 once. Not hot path (error response only), but worth noting.

6. Utf8s.utf8Bytes(CharSequence) and utf8Bytes(CharSequence, int) duplicate the branching logic
Minor DRY — the unlimited variant could delegate to the limited one with Integer.MAX_VALUE.

7. WebSocketSendQueue.failConnection is non-atomic, safe only because it's I/O-thread-only
The read-modify-write of lastError is correct because all callers (safeSendBatch, tryReceiveAcks, ResponseHandler.onBinaryMessage, ResponseHandler.onClose) run on the I/O thread. A one-line comment stating this invariant would protect against a future refactor that moves a call site off-thread and silently breaks the first-failure guarantee.

8. checkConnectionError() in catch blocks — document intent
In flush(), flushSync(), and sealAndSwapBuffer() catch blocks, checkConnectionError() is called before throw e. If it throws, e is silently shadowed. This is actually the desired behavior: WebSocketSendQueue.failConnection sets lastError before invoking the listener, so by the time the user thread reaches the catch, connectionError holds the original unwrapped error (cleaner message than e, which is checkError()'s wrapped version). A one-line comment on the pattern would make the intent obvious.

9. WebSocketResponse.readFrom still allocates per error ACK
Line 202 new byte[msgLen] + line 206 new String(...). Out of scope for this PR, but the remaining GC site on the response path if you want the pair to be symmetric with the zero-GC send side.

Points I initially raised but had to withdraw

  • "Race on connectionError.set(null) during reconnect" — not reachable. connected never resets, so ensureConnected's inner block runs at most once per sender, and on first entry connectionError is already null. The I/O thread starts in IDLE state (new inFlightWindow count=0, pending buffer null) and immediately wait(100) on the lock — it can't fire recordConnectionFailure during that window.
  • "Redundant checkError() calls in enqueue" — not redundant. Comparing old vs. new pattern, the extras are deliberate: prefer surfacing the I/O root cause over the generic "Send queue is not running" message, and the inner call after !running covers the race window where the I/O thread fails mid-method. My suggestion to drop the outer call was wrong.

Recommendation

Land after (1), (2), (3). The rest are comments / follow-ups.

@jerrinot jerrinot changed the title fix(ilp): report actionable QWP websocket frame-size errors fix(qwp): surface WebSocket frame-size errors to sender callers Apr 22, 2026
@jerrinot jerrinot changed the title fix(qwp): surface WebSocket frame-size errors to sender callers fix(ilp): surface WebSocket frame-size errors to sender callers Apr 22, 2026
@mtopolnik
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 153 / 179 (85.47%)

file detail

path covered line new line coverage
🔵 io/questdb/client/cutlass/qwp/client/QwpWebSocketSender.java 23 35 65.71%
🔵 io/questdb/client/cutlass/http/client/WebSocketSendBuffer.java 14 20 70.00%
🔵 io/questdb/client/cutlass/qwp/client/WebSocketSendQueue.java 17 22 77.27%
🔵 io/questdb/client/cutlass/qwp/client/WebSocketResponse.java 16 19 84.21%
🔵 io/questdb/client/cutlass/qwp/client/NativeBufferWriter.java 6 6 100.00%
🔵 io/questdb/client/std/str/Utf8s.java 77 77 100.00%

@bluestreak01 bluestreak01 merged commit f1dd21c into main Apr 24, 2026
14 checks passed
@bluestreak01 bluestreak01 deleted the jh_better_error_reporting branch April 24, 2026 01:46
bluestreak01 added a commit that referenced this pull request Apr 27, 2026
Wires the drainer runtime onto the orphan-scanner foundation. With
drain_orphans=true the foreground sender now actually empties sibling
slots holding unacked data instead of just logging that they exist.

Per-drainer lifecycle:
1. Open CursorSendEngine on the slot — its constructor takes the slot
   lock; if another sender or drainer holds it, the engine throws and
   the drainer exits silently (LOCKED_BY_OTHER, not a failure).
2. Open a fresh WebSocketClient via the foreground sender's connect
   factory — separate connection, same auth/host/port/TLS config.
3. Run a CursorWebSocketSendLoop until ackedFsn catches up to the
   publishedFsn snapshot taken at startup.
4. On terminal failure (auth, recovery, budget), drop a .failed
   sentinel into the slot. Future scans skip it until an operator
   clears it manually — bounded retry, then human-in-the-loop.

Pool: bounded fixed-thread executor, daemon threads, sized by
max_background_drainers (default 4). Closes via cooperative stop +
3s grace; daemon threads ensure no JVM-exit blocking.

Visibility: QwpWebSocketSender#getBackgroundDrainers returns a snapshot
list of live drainers with {slot, target, acked, outcome, lastError}.

Test: ghost sender writes 30 distinct rows against a silent server and
closes fast — leaves an unacked slot. Foreground sender opens the same
group root with a different sender_id and drain_orphans=true against an
ack server; asserts every distinct payload reaches the new server. Plus
a sentinel-skip test confirming an operator-set .failed file disqualifies
the slot from the next foreground run's scan.

Empty active segments and stale hot spares are left in the slot dir per
spec decision #13 ("no automatic cleanup of empty slot dirs"); the
scanner's no-op behavior on empty slots makes this cheap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants