Skip to content

fix(fpss): treat Windows ERROR_IO_PENDING as transient read#470

Merged
userFRM merged 2 commits into
mainfrom
fix/469-windows-error-io-pending
May 5, 2026
Merged

fix(fpss): treat Windows ERROR_IO_PENDING as transient read#470
userFRM merged 2 commits into
mainfrom
fix/469-windows-error-io-pending

Conversation

@userFRM
Copy link
Copy Markdown
Owner

@userFRM userFRM commented May 5, 2026

Summary

Fixes #469. Python user on Windows reported a constant stream of:

FPSS read error error=IO error: Overlapped I/O operation is in progress. (os error 997)

followed by a reconnect storm.

ERROR_IO_PENDING (Win32 error 997) is what the Windows overlapped I/O layer returns while a non-blocking read is still in flight. Rust std maps raw OS error 997 to ErrorKind::Uncategorized, so the three existing transient-read checks — which only matched WouldBlock | TimedOut — fell through to the fatal arm and tore the connection down.

Fix

New helper crates/thetadatadx/src/fpss/framing.rs::is_transient_read(&io::Error) matches:

  • ErrorKind::WouldBlock
  • ErrorKind::TimedOut
  • raw_os_error() == Some(997) (Windows ERROR_IO_PENDING)

Three patched sites all delegate to it:

  1. crates/thetadatadx/src/fpss/io_loop.rs:687-696is_read_timeout (drives the I/O loop's command-drain branch)
  2. crates/thetadatadx/src/fpss/framing.rs:236-242 — pre-header retry decision
  3. crates/thetadatadx/src/fpss/framing.rs:349-355 — mid-frame retry decision

Behaviour on Linux / macOS is unchanged: WouldBlock and TimedOut continue to take the same path. Windows now joins them — the I/O loop drains queued commands and retries instead of escalating to a reconnect.

Tests

  • is_transient_read_recognises_windows_error_io_pending — unit test on io::Error::from_raw_os_error(997). Asserts the helper returns true, that ECONNRESET (104) does NOT, and that WouldBlock / TimedOut still match.
  • pre_header_error_io_pending_propagates_as_ioread_frame on a reader that returns os error 997 with zero bytes delivered must surface as Error::Io with raw_os_error() == Some(997), the exact path is_read_timeout then drains on.
  • mid_header_error_io_pending_retries_and_recovers — header byte 1 arrives, three os-error-997 stalls, then byte 2 + payload. Frame must decode cleanly.
  • mid_payload_error_io_pending_retries_and_recovers — header + 2 of 4 payload bytes, three os-error-997 stalls, then the remaining 2 payload bytes. Frame must decode with payload [0x01, 0x02, 0x03, 0x04].

Versioning

8.0.26 per the patch-only v8 line.

Local CI

  • cargo fmt --all -- --check clean
  • cargo clippy --workspace --all-targets -- -D warnings clean (also checked tools/server and tools/mcp sub-workspaces)
  • cargo test --workspace 304/304 main + 109 ffi + sub-suites all green; the 4 new tests pass
  • cargo deny check advisories / bans / licenses / sources all ok
  • cargo run -p thetadatadx --bin generate_sdk_surfaces --features config-file -- --check clean

Test plan

On Windows the overlapped socket layer surfaces in-flight reads as
ERROR_IO_PENDING (raw OS error 997) rather than WSAEWOULDBLOCK. Rust
std maps 997 to ErrorKind::Uncategorized, so the existing kind matches
in fpss/io_loop.rs::is_read_timeout and the two retry arms in
fpss/framing.rs (pre-header and mid-payload) treated it as fatal.
Python users on Windows saw FPSS read error error=IO error: Overlapped
I/O operation is in progress. (os error 997) spam followed by a
reconnect storm.

Centralise transient-read detection in framing::is_transient_read,
which matches WouldBlock | TimedOut plus raw_os_error() == Some(997)
(ERROR_IO_PENDING). All three sites delegate to it so the I/O loop
drains queued commands and retries the way it does on Linux and macOS.

Tests: unit test pinning the helper on os_error(997), plus three
integration-style tests against the existing mock readers covering the
pre-header propagate-as-Io path and the mid-header / mid-payload
retry-and-recover paths under raw OS error 997.

Bumps tdbe 0.12.5 -> 0.12.7 and the workspace 8.0.24 -> 8.0.26.

Closes #469

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a Windows-specific FPSS reconnect/log spam issue by classifying Win32 ERROR_IO_PENDING (os error 997) as a transient read condition (similar to WouldBlock/TimedOut), preventing benign in-flight reads from being treated as fatal disconnects. It also bumps crate/SDK versions to ship the fix.

Changes:

  • Added a shared is_transient_read(&io::Error) helper (including raw_os_error == 997) and routed FPSS read-timeout logic through it.
  • Updated FPSS framing retry branches to use the shared transient-read classification and added regression tests covering pre-header/mid-header/mid-payload behavior.
  • Bumped versions across Rust crates/tools and TypeScript/Python SDK packaging metadata; refreshed lockfiles and changelogs.

Reviewed changes

Copilot reviewed 16 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
crates/thetadatadx/src/fpss/framing.rs Adds ERROR_IO_PENDING constant + is_transient_read helper; uses it in mid-frame retry logic; adds Windows-997 regression tests.
crates/thetadatadx/src/fpss/io_loop.rs Delegates read-timeout classification to framing::is_transient_read to keep transient logic consistent.
crates/thetadatadx/Cargo.toml Bumps thetadatadx version and tdbe dependency version.
crates/tdbe/Cargo.toml Bumps tdbe crate version.
tools/server/Cargo.toml Bumps thetadatadx-server and tdbe dependency versions.
tools/server/Cargo.lock Lockfile refresh for new thetadatadx/tdbe versions.
tools/mcp/Cargo.toml Bumps thetadatadx-mcp and tdbe dependency versions.
tools/mcp/Cargo.lock Lockfile refresh for new thetadatadx/tdbe versions.
tools/cli/Cargo.toml Bumps thetadatadx-cli and tdbe dependency versions.
ffi/Cargo.toml Bumps thetadatadx-ffi and tdbe dependency versions.
sdks/python/Cargo.toml Bumps thetadatadx-py and tdbe dependency versions.
sdks/python/Cargo.lock Lockfile refresh for new thetadatadx/tdbe versions.
sdks/typescript/Cargo.toml Bumps thetadatadx-napi and tdbe dependency versions.
sdks/typescript/Cargo.lock Lockfile refresh for new thetadatadx/tdbe versions.
sdks/typescript/package.json Bumps TypeScript SDK version + optional native package versions.
sdks/typescript/npm/win32-x64-msvc/package.json Bumps Windows prebuilt package version.
sdks/typescript/npm/linux-x64-gnu/package.json Bumps Linux prebuilt package version.
sdks/typescript/npm/darwin-arm64/package.json Bumps macOS ARM64 prebuilt package version.
CHANGELOG.md Adds 8.0.26 release notes describing the Windows FPSS fix + tdbe bump.
docs-site/docs/changelog.md Mirrors 8.0.26 release notes for the docs site.
Cargo.lock Workspace lockfile refresh for new thetadatadx/tdbe versions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1427 to +1438
/// transient read. Rust `std` maps 997 to `ErrorKind::Uncategorized`,
/// so a plain `kind()` match would miss it and treat the in-flight
/// overlapped read as a fatal disconnect — which is exactly what the
/// Python user reported in issue #469.
#[test]
fn is_transient_read_recognises_windows_error_io_pending() {
let err = std::io::Error::from_raw_os_error(ERROR_IO_PENDING);
// Sanity: confirm the precondition that motivates this fix —
// `std` does not map 997 to a recognisable kind on any platform.
assert_ne!(err.kind(), std::io::ErrorKind::WouldBlock);
assert_ne!(err.kind(), std::io::ErrorKind::TimedOut);
assert_eq!(err.raw_os_error(), Some(997));
PR #470 ships before PR #468; reclaim the next sequential patch
(8.0.25) so release tags stay chronological. PR #468 will be
rebased to 8.0.26 once this lands.
@userFRM userFRM merged commit d891109 into main May 5, 2026
32 checks passed
@userFRM userFRM deleted the fix/469-windows-error-io-pending branch May 5, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(fpss-windows): treat ERROR_IO_PENDING (os error 997) as a transient read like WouldBlock

2 participants