fix(streaming): bound socket writes, seed historical deadlines, arm self-join guard, and own the wait-strategy surface#819
Merged
Conversation
…elf-join guard, and own the wait-strategy surface The FPSS socket set a read timeout but never a write timeout, so the credentials write that drives the lazy TLS handshake and every steady-state ping/subscribe write could wedge the I/O thread against an alive-but-not-draining peer; the connect timeout only bounds the SYN/ACK. Set a write deadline alongside the read deadline, sourced from the same connect-window budget, so a write `TimedOut` surfaces as a fatal I/O error and the established reconnect path takes over. Historical (MDDS) requests had no default per-request deadline: a server holding the HTTP/2 stream open while sending no chunks would hang the collect/stream drain, since gRPC keepalive only catches a fully dead peer. Seed the deadline from a new `HistoricalConfig::request_timeout_secs` (default 300s) when the caller set none; an explicit `with_deadline(...)` still overrides it and `with_deadline(Duration::ZERO)` opts a single request out. The Drop self-join guard read a consumer thread id that only the test harness ever recorded, so the protection against an inline join deadlock when a user callback drops the last client handle on the consumer thread was inert in shipped builds. Record the draining thread id at the single drain entry point every path routes through, so the guard fires in production. The bring-your-own wait-strategy drain exposed the underlying ring crate's trait in its public bound and the migration doc told users to name that crate directly. Re-export the wait-strategy trait and presets under `thetadatadx::streaming::wait` and point the generic bound and the doc at the crate-owned path, so a real Rust caller never adds the ring crate to their own manifest. A clean re-export was chosen over collapsing to the FFI-safe preset enum because the trait and presets re-export trivially, preserving the existing zero-cost generic escape hatch without an API break. The ping heartbeat used a blocking send while the rest of the control plane uses non-blocking try_send. The heartbeat is an idempotent liveness signal, so a momentarily full channel (the I/O thread is already draining a large outbound backlog, hence demonstrably alive) can skip one beat and fire the next an interval later, well inside any server ping deadline. Switch to try_send so a backpressured channel cannot pin the Drop path for up to one interval. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`HistoricalConfig.request_timeout_secs` is the default per-request deadline for historical queries; the cross-binding mandate requires every client-facing knob be reachable on each binding or declared rust-only with tracking. Bind it like the sibling `concurrent_requests` / `warn_on_buffered_threshold_bytes` knobs: C-ABI `thetadatadx_config_set_request_timeout_secs` + getter, Python `Config.request_timeout_secs` property, TypeScript `setRequestTimeoutSecs` / `requestTimeoutSecs` (BigInt seconds, matching the other `*Secs` knobs), and the C++ `Config::set_request_timeout_secs` / `get_request_timeout_secs` forwarders. Add the parity row and round-trip tests on each surface. Convert the wait-strategy re-export doc comment to inline-code spans so the broken-intra-doc-links gate resolves cleanly; the re-exported external items are not linkable targets from the owning crate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Five reliability and public-surface fixes across the real-time streaming and historical paths.
Socket write timeout (FPSS)
The connection set a read timeout but never a write timeout. The credentials write drives the lazy TLS handshake, and steady-state ping/subscribe writes can block indefinitely against a peer whose receive window has stalled (alive enough to ACK at the kernel but not draining the socket); the connect timeout only bounds the SYN/ACK. A write deadline is now set alongside the read deadline, sourced from the same connect-window budget. It persists for the life of the socket, so a write
TimedOutsurfaces as a fatal I/O error and the existing reconnect path takes over, mirroring the read-timeout liveness contract.Historical per-request deadline (MDDS)
Historical requests had no default deadline: a server holding the HTTP/2 stream open while sending no chunks would hang the collect/stream drain, since gRPC keepalive only detects a fully dead peer, not a live-but-silent one. A new
HistoricalConfig::request_timeout_secs(default 300s) seeds the per-request deadline when the caller set none. An explicitwith_deadline(...)still overrides it, andwith_deadline(Duration::ZERO)opts a single request out; a configured default of0disables the fallback.Production self-join guard
The
Dropself-join guard read a consumer thread id that only the test harness ever recorded, so the documented protection against an inline join deadlock — when a user callback drops the last client handle while running on the consumer thread — was inert in shipped builds. The draining thread id is now captured at the single drain entry point every path routes through (next_event/for_each*/poll_batch), so the guard fires in production.Wait-strategy surface no longer leaks the ring crate
The bring-your-own wait-strategy drain exposed the underlying ring crate's trait in its public generic bound, and the migration doc instructed users to name that crate directly in a
use. The wait-strategy trait and presets are now re-exported underthetadatadx::streaming::wait, and both the generic bound and the migration doc point at the crate-owned path. A real Rust caller never adds the ring crate to their own manifest. A clean re-export was chosen over collapsing to the FFI-safe preset enum because the trait and presets re-export trivially, preserving the existing zero-cost generic escape hatch with no API break.Heartbeat send consistency
The ping heartbeat used a blocking
sendwhile the rest of the control plane uses non-blockingtry_send. The heartbeat is an idempotent liveness signal: a momentarily full channel means the I/O thread is already draining a large outbound backlog (hence demonstrably alive), so skipping one beat and firing the next an interval later stays well inside any server-side ping deadline. It now usestry_send, so a backpressured channel cannot pin theDroppath for up to one interval.Verification
cargo fmt --all -- --checkclean.cargo test -p thetadatadx --lib— 814 passed.cargo build -p thetadatadxclean.cargo clippy -p thetadatadx --lib -- -D warningsclean. The pre-existing--all-targetsclippy failures (thetadatadx::wireunresolved,decodeprivate, intests/andexamples/) reproduce identically on the base branch and are unrelated. Both doc gates pass: the docs-site generator--checkreports generated pages match the registries, andcheck_docs_consistency.pyreports ok.🤖 Generated with Claude Code