Skip to content

fix(streaming): bound socket writes, seed historical deadlines, arm self-join guard, and own the wait-strategy surface#819

Merged
userFRM merged 2 commits into
mainfrom
fix/streaming-historical-hardening
Jun 16, 2026
Merged

fix(streaming): bound socket writes, seed historical deadlines, arm self-join guard, and own the wait-strategy surface#819
userFRM merged 2 commits into
mainfrom
fix/streaming-historical-hardening

Conversation

@userFRM

@userFRM userFRM commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Five reliability and public-surface fixes across the real-time streaming and historical paths.

Socket write timeout (FPSS)

The connection set a read timeout but never a write timeout. The credentials write drives the lazy TLS handshake, and steady-state ping/subscribe writes can block indefinitely against a peer whose receive window has stalled (alive enough to ACK at the kernel but not draining the socket); the connect timeout only bounds the SYN/ACK. A write deadline is now set alongside the read deadline, sourced from the same connect-window budget. It persists for the life of the socket, so a write TimedOut surfaces as a fatal I/O error and the existing reconnect path takes over, mirroring the read-timeout liveness contract.

Historical per-request deadline (MDDS)

Historical requests had no default deadline: a server holding the HTTP/2 stream open while sending no chunks would hang the collect/stream drain, since gRPC keepalive only detects a fully dead peer, not a live-but-silent one. A new HistoricalConfig::request_timeout_secs (default 300s) seeds the per-request deadline when the caller set none. An explicit with_deadline(...) still overrides it, and with_deadline(Duration::ZERO) opts a single request out; a configured default of 0 disables the fallback.

Production self-join guard

The Drop self-join guard read a consumer thread id that only the test harness ever recorded, so the documented protection against an inline join deadlock — when a user callback drops the last client handle while running on the consumer thread — was inert in shipped builds. The draining thread id is now captured at the single drain entry point every path routes through (next_event / for_each* / poll_batch), so the guard fires in production.

Wait-strategy surface no longer leaks the ring crate

The bring-your-own wait-strategy drain exposed the underlying ring crate's trait in its public generic bound, and the migration doc instructed users to name that crate directly in a use. The wait-strategy trait and presets are now re-exported under thetadatadx::streaming::wait, and both the generic bound and the migration doc point at the crate-owned path. A real Rust caller never adds the ring crate to their own manifest. A clean re-export was chosen over collapsing to the FFI-safe preset enum because the trait and presets re-export trivially, preserving the existing zero-cost generic escape hatch with no API break.

Heartbeat send consistency

The ping heartbeat used a blocking send while the rest of the control plane uses non-blocking try_send. The heartbeat is an idempotent liveness signal: a momentarily full channel means the I/O thread is already draining a large outbound backlog (hence demonstrably alive), so skipping one beat and firing the next an interval later stays well inside any server-side ping deadline. It now uses try_send, so a backpressured channel cannot pin the Drop path for up to one interval.

Verification

cargo fmt --all -- --check clean. cargo test -p thetadatadx --lib — 814 passed. cargo build -p thetadatadx clean. cargo clippy -p thetadatadx --lib -- -D warnings clean. The pre-existing --all-targets clippy failures (thetadatadx::wire unresolved, decode private, in tests/ and examples/) reproduce identically on the base branch and are unrelated. Both doc gates pass: the docs-site generator --check reports generated pages match the registries, and check_docs_consistency.py reports ok.

🤖 Generated with Claude Code

…elf-join guard, and own the wait-strategy surface

The FPSS socket set a read timeout but never a write timeout, so the credentials write that drives the lazy TLS handshake and every steady-state ping/subscribe write could wedge the I/O thread against an alive-but-not-draining peer; the connect timeout only bounds the SYN/ACK. Set a write deadline alongside the read deadline, sourced from the same connect-window budget, so a write `TimedOut` surfaces as a fatal I/O error and the established reconnect path takes over.

Historical (MDDS) requests had no default per-request deadline: a server holding the HTTP/2 stream open while sending no chunks would hang the collect/stream drain, since gRPC keepalive only catches a fully dead peer. Seed the deadline from a new `HistoricalConfig::request_timeout_secs` (default 300s) when the caller set none; an explicit `with_deadline(...)` still overrides it and `with_deadline(Duration::ZERO)` opts a single request out.

The Drop self-join guard read a consumer thread id that only the test harness ever recorded, so the protection against an inline join deadlock when a user callback drops the last client handle on the consumer thread was inert in shipped builds. Record the draining thread id at the single drain entry point every path routes through, so the guard fires in production.

The bring-your-own wait-strategy drain exposed the underlying ring crate's trait in its public bound and the migration doc told users to name that crate directly. Re-export the wait-strategy trait and presets under `thetadatadx::streaming::wait` and point the generic bound and the doc at the crate-owned path, so a real Rust caller never adds the ring crate to their own manifest. A clean re-export was chosen over collapsing to the FFI-safe preset enum because the trait and presets re-export trivially, preserving the existing zero-cost generic escape hatch without an API break.

The ping heartbeat used a blocking send while the rest of the control plane uses non-blocking try_send. The heartbeat is an idempotent liveness signal, so a momentarily full channel (the I/O thread is already draining a large outbound backlog, hence demonstrably alive) can skip one beat and fire the next an interval later, well inside any server ping deadline. Switch to try_send so a backpressured channel cannot pin the Drop path for up to one interval.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@userFRM userFRM enabled auto-merge (squash) June 16, 2026 13:17
`HistoricalConfig.request_timeout_secs` is the default per-request deadline for historical queries; the cross-binding mandate requires every client-facing knob be reachable on each binding or declared rust-only with tracking. Bind it like the sibling `concurrent_requests` / `warn_on_buffered_threshold_bytes` knobs: C-ABI `thetadatadx_config_set_request_timeout_secs` + getter, Python `Config.request_timeout_secs` property, TypeScript `setRequestTimeoutSecs` / `requestTimeoutSecs` (BigInt seconds, matching the other `*Secs` knobs), and the C++ `Config::set_request_timeout_secs` / `get_request_timeout_secs` forwarders. Add the parity row and round-trip tests on each surface.

Convert the wait-strategy re-export doc comment to inline-code spans so the broken-intra-doc-links gate resolves cleanly; the re-exported external items are not linkable targets from the owning crate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@userFRM userFRM merged commit 7fd9cd5 into main Jun 16, 2026
42 of 43 checks passed
@userFRM userFRM deleted the fix/streaming-historical-hardening branch June 16, 2026 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants