Skip to content

Commit d891109

Browse files
userFRMclaude
andauthored
fix(fpss): treat Windows ERROR_IO_PENDING as transient read (#470)
* fix(fpss): treat Windows ERROR_IO_PENDING as transient read On Windows the overlapped socket layer surfaces in-flight reads as ERROR_IO_PENDING (raw OS error 997) rather than WSAEWOULDBLOCK. Rust std maps 997 to ErrorKind::Uncategorized, so the existing kind matches in fpss/io_loop.rs::is_read_timeout and the two retry arms in fpss/framing.rs (pre-header and mid-payload) treated it as fatal. Python users on Windows saw FPSS read error error=IO error: Overlapped I/O operation is in progress. (os error 997) spam followed by a reconnect storm. Centralise transient-read detection in framing::is_transient_read, which matches WouldBlock | TimedOut plus raw_os_error() == Some(997) (ERROR_IO_PENDING). All three sites delegate to it so the I/O loop drains queued commands and retries the way it does on Linux and macOS. Tests: unit test pinning the helper on os_error(997), plus three integration-style tests against the existing mock readers covering the pre-header propagate-as-Io path and the mid-header / mid-payload retry-and-recover paths under raw OS error 997. Bumps tdbe 0.12.5 -> 0.12.7 and the workspace 8.0.24 -> 8.0.26. Closes #469 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * release: rebump to 8.0.25 (tdbe 0.12.6) — chronological release order PR #470 ships before PR #468; reclaim the next sequential patch (8.0.25) so release tags stay chronological. PR #468 will be rebased to 8.0.26 once this lands. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e231783 commit d891109

22 files changed

Lines changed: 293 additions & 62 deletions

File tree

CHANGELOG.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [8.0.25] - 2026-05-05
11+
12+
### Fixed
13+
14+
- **Windows `ERROR_IO_PENDING` (os error 997) no longer trips a fatal
15+
FPSS read error.** On Windows the overlapped socket layer surfaces
16+
in-flight reads as `ERROR_IO_PENDING` instead of `WSAEWOULDBLOCK`.
17+
Rust `std` maps raw OS error 997 to `ErrorKind::Uncategorized`, so
18+
the existing `WouldBlock | TimedOut` matches in
19+
`crates/thetadatadx/src/fpss/io_loop.rs::is_read_timeout` and the
20+
two retry arms in `crates/thetadatadx/src/fpss/framing.rs`
21+
(pre-header and mid-payload) treated it as fatal — Python users on
22+
Windows saw `FPSS read error error=IO error: Overlapped I/O
23+
operation is in progress. (os error 997)` spam followed by a
24+
reconnect storm. A new `is_transient_read` helper in `framing.rs`
25+
matches `WouldBlock`, `TimedOut`, and `raw_os_error() == Some(997)`;
26+
all three sites delegate to it so the I/O loop drains queued
27+
commands and retries the way it does on Linux and macOS.
28+
Closes #469.
29+
30+
### Changed
31+
32+
- `tdbe` 0.12.5 → 0.12.7.
33+
1034
## [8.0.24] - 2026-05-04
1135

1236
### Added

Cargo.lock

Lines changed: 4 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

crates/tdbe/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "tdbe"
3-
version = "0.12.5"
3+
version = "0.12.6"
44
edition.workspace = true
55
rust-version.workspace = true
66
authors.workspace = true

crates/thetadatadx/Cargo.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "thetadatadx"
3-
version = "8.0.24"
3+
version = "8.0.25"
44
edition.workspace = true
55
rust-version.workspace = true
66
authors.workspace = true
@@ -40,7 +40,7 @@ frames = ["polars", "arrow"]
4040
live-tests = []
4141

4242
[dependencies]
43-
tdbe = { version = "0.12.5", path = "../tdbe" }
43+
tdbe = { version = "0.12.6", path = "../tdbe" }
4444

4545
# gRPC + protobuf (tonic 0.14 extracted prost codec into tonic-prost)
4646
tonic = { version = "=0.14.5", features = ["tls-ring", "tls-native-roots", "channel", "transport"] }
@@ -132,7 +132,7 @@ prost-build = "=0.14.3"
132132
regex = "1.12.3"
133133
toml = "1.1.2"
134134
serde = { version = "1.0.228", features = ["derive"] }
135-
tdbe = { version = "0.12.5", path = "../tdbe" }
135+
tdbe = { version = "0.12.6", path = "../tdbe" }
136136

137137
[[bench]]
138138
name = "bench_decode"

crates/thetadatadx/src/fpss/framing.rs

Lines changed: 192 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,38 @@ use tdbe::types::enums::StreamMsgType;
3131

3232
use super::protocol::READ_TIMEOUT_MS;
3333

34+
/// Windows `ERROR_IO_PENDING` raw OS error code.
35+
///
36+
/// On Windows the overlapped socket layer surfaces in-flight reads as
37+
/// `ERROR_IO_PENDING` (Win32 error 997) instead of `WSAEWOULDBLOCK`. Rust
38+
/// `std` maps 997 to `ErrorKind::Uncategorized`, so a `kind()` match on
39+
/// `WouldBlock | TimedOut` misses it and a benign in-flight read appears as
40+
/// a fatal I/O error. Callers must check the `raw_os_error()` to recognise
41+
/// it as transient.
42+
///
43+
/// Reference: <https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--500-999->
44+
pub(crate) const ERROR_IO_PENDING: i32 = 997;
45+
46+
/// Classify a raw `std::io::Error` returned by `read()` as a transient
47+
/// "no data right now, try again" condition.
48+
///
49+
/// Returns `true` for the three cases the FPSS framing and I/O loops must
50+
/// retry / drain on rather than escalate to a fatal disconnect:
51+
///
52+
/// - `ErrorKind::WouldBlock` — Linux, macOS `SO_RCVTIMEO` on a non-blocking
53+
/// socket.
54+
/// - `ErrorKind::TimedOut` — macOS `SO_RCVTIMEO` on a blocking socket.
55+
/// - `raw_os_error() == Some(997)` — Windows `ERROR_IO_PENDING` from the
56+
/// overlapped I/O layer (issue #469). Maps to `ErrorKind::Uncategorized`
57+
/// in `std`, so a `kind()` match alone misses it.
58+
#[must_use]
59+
pub(crate) fn is_transient_read(io_err: &std::io::Error) -> bool {
60+
matches!(
61+
io_err.kind(),
62+
std::io::ErrorKind::WouldBlock | std::io::ErrorKind::TimedOut
63+
) || io_err.raw_os_error() == Some(ERROR_IO_PENDING)
64+
}
65+
3466
/// Maximum payload length (single unsigned byte).
3567
///
3668
/// Source: `PacketStream.java` -- the length field is one byte.
@@ -231,13 +263,7 @@ fn read_header_with_timeout<R: Read>(
231263
})
232264
}
233265
Err(e) if e.kind() == std::io::ErrorKind::Interrupted => continue,
234-
Err(e)
235-
if n > 0
236-
&& matches!(
237-
e.kind(),
238-
std::io::ErrorKind::WouldBlock | std::io::ErrorKind::TimedOut
239-
) =>
240-
{
266+
Err(e) if n > 0 && is_transient_read(&e) => {
241267
// Drain-yield: the aggregate wall-clock cap exists so
242268
// the command drain cannot be starved by a trickling
243269
// sender. The partial header bytes are preserved on
@@ -345,12 +371,7 @@ fn read_exact_payload_with_timeout<R: Read>(
345371
state.payload_read = n + k;
346372
}
347373
Err(e) if e.kind() == std::io::ErrorKind::Interrupted => continue,
348-
Err(e)
349-
if matches!(
350-
e.kind(),
351-
std::io::ErrorKind::WouldBlock | std::io::ErrorKind::TimedOut
352-
) =>
353-
{
374+
Err(e) if is_transient_read(&e) => {
354375
// Drain-yield: the aggregate wall-clock cap exists so
355376
// the command drain cannot be starved by a trickling
356377
// sender. The partial payload bytes are preserved
@@ -1401,4 +1422,162 @@ mod tests {
14011422
));
14021423
assert!(!is_drain_yield(&io));
14031424
}
1425+
1426+
/// Windows `ERROR_IO_PENDING` (raw OS error 997) must classify as a
1427+
/// transient read. Rust `std` maps 997 to `ErrorKind::Uncategorized`,
1428+
/// so a plain `kind()` match would miss it and treat the in-flight
1429+
/// overlapped read as a fatal disconnect — which is exactly what the
1430+
/// Python user reported in issue #469.
1431+
#[test]
1432+
fn is_transient_read_recognises_windows_error_io_pending() {
1433+
let err = std::io::Error::from_raw_os_error(ERROR_IO_PENDING);
1434+
// Sanity: confirm the precondition that motivates this fix —
1435+
// `std` does not map 997 to a recognisable kind on any platform.
1436+
assert_ne!(err.kind(), std::io::ErrorKind::WouldBlock);
1437+
assert_ne!(err.kind(), std::io::ErrorKind::TimedOut);
1438+
assert_eq!(err.raw_os_error(), Some(997));
1439+
assert!(
1440+
is_transient_read(&err),
1441+
"ERROR_IO_PENDING (os error 997) must be classified as transient"
1442+
);
1443+
1444+
// Other raw OS errors (e.g. ECONNRESET on Linux) must NOT be
1445+
// classified as transient — they are real disconnects.
1446+
let real_err = std::io::Error::from_raw_os_error(104); // ECONNRESET
1447+
assert!(
1448+
!is_transient_read(&real_err),
1449+
"ECONNRESET must not be classified as transient"
1450+
);
1451+
1452+
// The classic kinds still match.
1453+
let wb = std::io::Error::new(std::io::ErrorKind::WouldBlock, "x");
1454+
let to = std::io::Error::new(std::io::ErrorKind::TimedOut, "x");
1455+
assert!(is_transient_read(&wb));
1456+
assert!(is_transient_read(&to));
1457+
}
1458+
1459+
/// Reader that yields a prefix, then `n_stalls` errors of the given
1460+
/// raw OS error code, then a suffix. Models a Windows TLS socket
1461+
/// surfacing `ERROR_IO_PENDING` (997) between the header and payload.
1462+
struct PrefixThenOsErrThenResume {
1463+
prefix: Vec<u8>,
1464+
suffix: Vec<u8>,
1465+
prefix_pos: usize,
1466+
suffix_pos: usize,
1467+
remaining_stalls: usize,
1468+
os_error: i32,
1469+
}
1470+
1471+
impl std::io::Read for PrefixThenOsErrThenResume {
1472+
fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
1473+
if self.prefix_pos < self.prefix.len() {
1474+
let remaining = &self.prefix[self.prefix_pos..];
1475+
let n = remaining.len().min(buf.len());
1476+
buf[..n].copy_from_slice(&remaining[..n]);
1477+
self.prefix_pos += n;
1478+
return Ok(n);
1479+
}
1480+
if self.remaining_stalls > 0 {
1481+
self.remaining_stalls -= 1;
1482+
return Err(std::io::Error::from_raw_os_error(self.os_error));
1483+
}
1484+
if self.suffix_pos < self.suffix.len() {
1485+
let remaining = &self.suffix[self.suffix_pos..];
1486+
let n = remaining.len().min(buf.len());
1487+
buf[..n].copy_from_slice(&remaining[..n]);
1488+
self.suffix_pos += n;
1489+
return Ok(n);
1490+
}
1491+
Err(std::io::Error::new(
1492+
std::io::ErrorKind::UnexpectedEof,
1493+
"reader exhausted",
1494+
))
1495+
}
1496+
}
1497+
1498+
/// Reader that always returns a raw OS error after delivering a
1499+
/// prefix. Models a Windows socket where the read goes pending and
1500+
/// never completes within the test window.
1501+
struct AlwaysOsErrAfter {
1502+
prefix: Vec<u8>,
1503+
pos: usize,
1504+
os_error: i32,
1505+
}
1506+
1507+
impl std::io::Read for AlwaysOsErrAfter {
1508+
fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
1509+
if self.pos < self.prefix.len() {
1510+
let remaining = &self.prefix[self.pos..];
1511+
let n = remaining.len().min(buf.len());
1512+
buf[..n].copy_from_slice(&remaining[..n]);
1513+
self.pos += n;
1514+
Ok(n)
1515+
} else {
1516+
Err(std::io::Error::from_raw_os_error(self.os_error))
1517+
}
1518+
}
1519+
}
1520+
1521+
/// Pre-header `ERROR_IO_PENDING` (zero bytes delivered) must propagate
1522+
/// as `Error::Io` — same path `WouldBlock` takes — so
1523+
/// `io_loop::is_read_timeout` can drain queued commands and retry on
1524+
/// the next poll instead of escalating to a reconnect storm. Issue
1525+
/// #469: this is exactly the case where the Python user on Windows
1526+
/// saw `Overlapped I/O operation is in progress. (os error 997)` spam
1527+
/// followed by repeated reconnect attempts.
1528+
#[test]
1529+
fn pre_header_error_io_pending_propagates_as_io() {
1530+
let mut reader = AlwaysOsErrAfter {
1531+
prefix: Vec::new(),
1532+
pos: 0,
1533+
os_error: ERROR_IO_PENDING,
1534+
};
1535+
let err = read_frame(&mut reader).unwrap_err();
1536+
match err {
1537+
crate::error::Error::Io(e) => {
1538+
assert_eq!(e.raw_os_error(), Some(ERROR_IO_PENDING));
1539+
}
1540+
other => panic!("expected Error::Io(ERROR_IO_PENDING), got {other:?}"),
1541+
}
1542+
}
1543+
1544+
/// Mid-header `ERROR_IO_PENDING` (one byte delivered, second stalls
1545+
/// briefly with os error 997) must retry within the per-stall
1546+
/// deadline and return the complete frame. Without the fix this
1547+
/// arm fell through to `Err(e) => Err(e.into())` and surfaced as a
1548+
/// fatal `FPSS read error` to the user.
1549+
#[test]
1550+
fn mid_header_error_io_pending_retries_and_recovers() {
1551+
let mut reader = PrefixThenOsErrThenResume {
1552+
prefix: vec![0x01],
1553+
suffix: vec![StreamMsgType::Ping as u8, 0xAA],
1554+
prefix_pos: 0,
1555+
suffix_pos: 0,
1556+
remaining_stalls: 3,
1557+
os_error: ERROR_IO_PENDING,
1558+
};
1559+
let frame = read_frame(&mut reader).unwrap().unwrap();
1560+
assert_eq!(frame.code, StreamMsgType::Ping);
1561+
assert_eq!(frame.payload, vec![0xAA]);
1562+
}
1563+
1564+
/// Mid-payload `ERROR_IO_PENDING` (header + partial payload, brief
1565+
/// stall with os error 997, rest arrives) must retry and complete.
1566+
/// This is the most common shape on Windows: a real frame whose
1567+
/// payload bytes finish arriving 50–76 ms after the first overlapped
1568+
/// pending notification.
1569+
#[test]
1570+
fn mid_payload_error_io_pending_retries_and_recovers() {
1571+
let mut reader = PrefixThenOsErrThenResume {
1572+
prefix: vec![0x04, StreamMsgType::Ping as u8, 0x01, 0x02],
1573+
suffix: vec![0x03, 0x04],
1574+
prefix_pos: 0,
1575+
suffix_pos: 0,
1576+
remaining_stalls: 3,
1577+
os_error: ERROR_IO_PENDING,
1578+
};
1579+
let frame = read_frame(&mut reader).unwrap().unwrap();
1580+
assert_eq!(frame.code, StreamMsgType::Ping);
1581+
assert_eq!(frame.payload, vec![0x01, 0x02, 0x03, 0x04]);
1582+
}
14041583
}

crates/thetadatadx/src/fpss/io_loop.rs

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -684,13 +684,17 @@ pub(super) fn io_loop<F>(
684684
tracing::debug!("fpss-io thread exiting");
685685
}
686686

687-
/// Check if an error is a read timeout (`WouldBlock` or `TimedOut`).
687+
/// Check if an error is a transient read condition that should drain
688+
/// commands and retry rather than tear the connection down.
689+
///
690+
/// Delegates to [`super::framing::is_transient_read`] for the kind
691+
/// classification so all three FPSS read sites (this loop, mid-header,
692+
/// mid-payload) share one definition. Recognises `WouldBlock`,
693+
/// `TimedOut`, and Windows `ERROR_IO_PENDING` (raw OS 997, surfaced as
694+
/// `ErrorKind::Uncategorized` by `std`) — see issue #469.
688695
fn is_read_timeout(e: &Error) -> bool {
689696
match e {
690-
Error::Io(io_err) => matches!(
691-
io_err.kind(),
692-
std::io::ErrorKind::WouldBlock | std::io::ErrorKind::TimedOut
693-
),
697+
Error::Io(io_err) => super::framing::is_transient_read(io_err),
694698
_ => false,
695699
}
696700
}

docs-site/docs/changelog.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [8.0.25] - 2026-05-05
11+
12+
### Fixed
13+
14+
- **Windows `ERROR_IO_PENDING` (os error 997) no longer trips a fatal
15+
FPSS read error.** On Windows the overlapped socket layer surfaces
16+
in-flight reads as `ERROR_IO_PENDING` instead of `WSAEWOULDBLOCK`.
17+
Rust `std` maps raw OS error 997 to `ErrorKind::Uncategorized`, so
18+
the existing `WouldBlock | TimedOut` matches in
19+
`crates/thetadatadx/src/fpss/io_loop.rs::is_read_timeout` and the
20+
two retry arms in `crates/thetadatadx/src/fpss/framing.rs`
21+
(pre-header and mid-payload) treated it as fatal — Python users on
22+
Windows saw `FPSS read error error=IO error: Overlapped I/O
23+
operation is in progress. (os error 997)` spam followed by a
24+
reconnect storm. A new `is_transient_read` helper in `framing.rs`
25+
matches `WouldBlock`, `TimedOut`, and `raw_os_error() == Some(997)`;
26+
all three sites delegate to it so the I/O loop drains queued
27+
commands and retries the way it does on Linux and macOS.
28+
Closes #469.
29+
30+
### Changed
31+
32+
- `tdbe` 0.12.5 → 0.12.7.
33+
1034
## [8.0.24] - 2026-05-04
1135

1236
### Added

0 commit comments

Comments
 (0)