Skip to content

Commit e4b9112

Browse files
committed
fix(v9.1.0): close PR #514 external audit findings
External audit closed-out: HIGH-001 — multi-generation drain barrier. The single-slot `prev_drained: Mutex<Option<Arc<AtomicBool>>>` overwrote on every retire, so a stacked stop/start/stop where the earlier session was still draining when the later one retired silently lost the earlier flag. `await_drain()` then returned `true` based on the latest generation while the earlier callback could still fire on a freed FFI `ctx` — a use-after-free hazard under reconnect-storm scenarios. Replaced with `Mutex<Vec<Arc<AtomicBool>>>` on both `ThetaDataDx` and `TdxFpssHandle`; every retired session's flag is pushed onto the Vec, and `await_drain()` / `tdx_*_free` walk the full set with lazy GC. Soak coverage at `streaming_soak_tests::multi_gen_drain_waits_for_all_retired_sessions` drives three flags through the production poll cadence with a staggered drain order so the pre-fix code path is a hard regression gate. MED-001 — WS payload now carries `unresolved_contract_id` for pre-`ContractAssigned` ticks. The decoder builds an unresolved- contract sentinel whose `symbol` is `__pending:<id>` (the canonical `sec_type == SecType::Unknown` check still gates consumer code paths); the WS formatter detects the prefix, emits `contract: {"status": "pending"}`, and surfaces the parsed wire id as a top-level integer. Public SDK callback signature unchanged. LOW-001 — WS `/subscribe` option path now runs the canonical Gregorian validator alongside the bounds check. Impossible dates like `20260230` (Feb 30) or `20260431` (Apr 31) no longer leak through. REPO-MED-001 — documented the explicit-handoff contract on Python and TypeScript `stop_streaming`, `shutdown`, and `reconnect`. Sourced from the codegen surface (`sdk_surface.toml` + `build_support/sdk_surface/{python,typescript}.rs`) so the generated `streaming_methods.rs` files stay in sync. CHANGELOG, `.github/release-notes/v9.1.0.md`, and `docs-site/docs/changelog.md` updated. Codegen `--check` clean, banned-vocab grep zero matches, full workspace tests green.
1 parent 6845f4a commit e4b9112

17 files changed

Lines changed: 932 additions & 198 deletions

File tree

.github/release-notes/v9.1.0.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,38 @@ the user callback directly from the Disruptor consumer thread.
210210

211211
### Fixed
212212

213+
- **External audit: single-slot drain barrier could falsely report
214+
quiescence under stacked lifecycle transitions.** `prev_drained`
215+
tracked only the most-recently-retired session's flag; a stacked
216+
`start → stop → start → stop` cycle in which the earlier session
217+
was still draining when the later one retired silently lost the
218+
earlier flag, and `await_drain()` returned `true` while the
219+
earlier callback could still be firing on the FFI `ctx`. The slot
220+
is now a `Mutex<Vec<Arc<AtomicBool>>>`; `await_drain()` and
221+
`tdx_*_free` walk every retired generation, lazily GC'ing flags
222+
that have flipped. Multi-generation soak coverage in
223+
`streaming_soak_tests.rs::multi_gen_drain_waits_for_all_retired_sessions`.
224+
- **WS payload now carries `unresolved_contract_id` for
225+
pre-`ContractAssigned` ticks.** The decoder builds an unresolved-
226+
contract sentinel whose `symbol` is `__pending:<id>`; the WS
227+
formatter detects the prefix, emits `contract: {"status":
228+
"pending"}`, and surfaces the parsed wire id as a top-level
229+
`unresolved_contract_id` integer. The public SDK callback signature
230+
is unchanged — `__pending:` is a diagnostic payload, not a stable
231+
identifier; downstream consumers continue to pattern-match on
232+
`sec_type == SecType::Unknown`.
233+
- **WS `/subscribe` option path now runs the canonical Gregorian
234+
validator.** The Wave H `tdbe::time::is_valid_yyyymmdd` calendar
235+
check ran on the historical / REST surfaces but not on the WS
236+
option-subscribe path; impossible dates like `20260230` (Feb 30)
237+
or `20260431` (Apr 31) leaked through. Both gates now run.
238+
- **Python and TypeScript bindings: stop / shutdown clear the
239+
registered callback.** Documented the explicit-handoff contract
240+
on `stop_streaming`, `shutdown`, and `reconnect` for both
241+
bindings. The unified C API preserves the callback by design; the
242+
high-level bindings clear it deliberately to enforce the explicit
243+
re-handoff model shared with the C++ RAII wrapper and the Python
244+
`with`-block `__exit__`.
213245
- **`stop_streaming()` race that could resurrect streaming after
214246
stop returned.** Each `stop_streaming` now bumps an `AtomicU64`
215247
generation counter; each `start_streaming*()` snapshots the counter

CHANGELOG.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,51 @@ code does not change.
171171

172172
### Fixed
173173

174+
- **External audit: single-slot drain barrier could falsely report
175+
quiescence under stacked lifecycle transitions.** The
176+
`prev_drained` slot tracked only the most recently retired session's
177+
flag, so a `start → stop → start → stop` sequence in which the
178+
earlier session was still draining when the later one retired
179+
silently lost the earlier flag. `await_drain()` then returned `true`
180+
based on the latest generation while the earlier callback could
181+
still be firing on the FFI `ctx` — a use-after-free vector under
182+
reconnect-storm scenarios. The slot is now a
183+
`Mutex<Vec<Arc<AtomicBool>>>`; every retired session's flag is
184+
pushed onto the Vec, and `await_drain()` / `tdx_*_free` walk the
185+
full set, lazily GC'ing flags that have flipped. Mirrored on the
186+
FFI handle's `prev_drained` field. Regression coverage:
187+
`crates/thetadatadx/src/fpss/streaming_soak_tests.rs::multi_gen_drain_waits_for_all_retired_sessions`
188+
drives three real `FpssClient` instances back-to-back with slow
189+
callbacks and asserts the barrier waits for every generation.
190+
- **WS payload now carries `unresolved_contract_id` for
191+
pre-`ContractAssigned` ticks.** Pre-Wave-G the WS bridge surfaced
192+
the wire-internal numeric id; post-removal of the public `contract_id`
193+
field, ticks that arrived before the matching `ContractAssigned`
194+
frame serialised as an empty `Contract` envelope with no diagnostic
195+
channel for operators to correlate. The decoder now builds an
196+
unresolved-contract sentinel whose `symbol` is `__pending:<id>`
197+
(the canonical `sec_type == SecType::Unknown` check still gates
198+
consumer code paths); the WS formatter detects the prefix, emits
199+
`contract: {"status": "pending"}`, and surfaces the parsed wire id
200+
as a top-level `unresolved_contract_id` integer. The public SDK
201+
callback signature is unchanged — `__pending:` is a diagnostic
202+
payload, not a stable identifier.
203+
- **WS `/subscribe` option path now runs the canonical Gregorian
204+
validator.** The Wave H `tdbe::time::is_valid_yyyymmdd` calendar
205+
check ran on the historical / REST surfaces but not on the WS
206+
option-subscribe path, which only applied the cheap
207+
`is_valid_yyyymmdd_range` bounds check. Impossible dates like
208+
`20260230` (Feb 30), `20260431` (Apr 31), or `20251301` (month 13)
209+
leaked through. Both gates now run; the bounds check is the
210+
precheck, the calendar validator is the real gate.
211+
- **Python and TypeScript bindings: stop / shutdown clear the
212+
registered callback.** The unified C API preserves the callback
213+
across stop/reconnect, but the high-level bindings deliberately
214+
diverge: `stop_streaming()` and `shutdown()` clear the stored
215+
callback, so a subsequent `reconnect()` raises until the caller
216+
re-registers via `start_streaming(callback)`. Documented on every
217+
affected method (`stop_streaming`, `shutdown`, `reconnect` on both
218+
bindings) so the explicit-handoff model is no longer surprising.
174219
- **`stop_streaming()` race that could resurrect streaming after
175220
stop returned.** The `ArcSwap` slot accepted `Stopped → Live`, so
176221
an in-flight `start_streaming*()` that began before

crates/thetadatadx/build_support/sdk_surface/python.rs

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -255,7 +255,25 @@ fn python_streaming_method(method: &MethodSpec) -> String {
255255
`RuntimeError` if no callback is registered. All\n\
256256
active subscriptions are restored on the new\n\
257257
connection — see `thetadatadx::ThetaDataDx::reconnect_streaming`\n\
258-
for partial-failure semantics.",
258+
for partial-failure semantics.\n\
259+
\n\
260+
# Callback lifetime across `stop_streaming`\n\
261+
\n\
262+
`stop_streaming()` and `shutdown()` clear the registered\n\
263+
callback. To resume streaming on this client after\n\
264+
`stop_streaming()`, you MUST call `start_streaming(callback)`\n\
265+
again with a freshly bound callable; `reconnect()` raises\n\
266+
`RuntimeError` because no callback is held.\n\
267+
\n\
268+
This explicit-handoff model matches the C++ wrapper's RAII\n\
269+
destructor and the Python `with` block's `__exit__`: the\n\
270+
resource (the callback closure plus its captured environment)\n\
271+
is cleared at the same scope boundary the user observes. The\n\
272+
unified C API preserves the callback across stop/reconnect,\n\
273+
but the Python and TypeScript bindings deliberately diverge\n\
274+
to enforce the explicit handoff and avoid retaining captured\n\
275+
references past a teardown the application has already\n\
276+
observed.",
259277
);
260278
writeln!(out, " fn {}(&self) -> PyResult<()> {{", method.name).unwrap();
261279
out.push_str(include_str!("templates/python/reconnect_body.rs.tmpl"));

crates/thetadatadx/build_support/sdk_surface/typescript.rs

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,24 @@ fn ts_streaming_method(method: &MethodSpec) -> String {
189189
no callback is registered. All active subscriptions are\n\
190190
restored on the new connection — see\n\
191191
`thetadatadx::ThetaDataDx::reconnect_streaming` for\n\
192-
partial-failure semantics.",
192+
partial-failure semantics.\n\
193+
\n\
194+
# Callback lifetime across `stopStreaming`\n\
195+
\n\
196+
`stopStreaming()` and `shutdown()` clear the registered\n\
197+
callback. To resume streaming on this client after\n\
198+
`stopStreaming()`, you MUST call `startStreaming(callback)`\n\
199+
again with a freshly bound function; `reconnect()` throws\n\
200+
because no callback is held.\n\
201+
\n\
202+
This explicit-handoff model matches the C++ wrapper's RAII\n\
203+
destructor and the Python `with` block's `__exit__`: the\n\
204+
resource (the JS callback handle) is cleared at the same\n\
205+
scope boundary the application observes. The unified C API\n\
206+
preserves the callback across stop/reconnect, but the\n\
207+
TypeScript and Python bindings deliberately diverge to enforce\n\
208+
the explicit handoff and avoid retaining captured references\n\
209+
past a teardown the caller has already observed.",
193210
);
194211
writeln!(out, " #[napi(js_name = \"reconnect\")]").unwrap();
195212
writeln!(

crates/thetadatadx/sdk_surface.toml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -215,13 +215,17 @@ targets = ["python_unified", "typescript_napi", "cpp_fpss"]
215215
[[methods]]
216216
name = "stop_streaming"
217217
kind = "stop_streaming"
218-
doc = "Stop streaming while keeping the historical client usable."
218+
doc = """Stop streaming while keeping the historical client usable.
219+
220+
Clears the registered callback. To resume streaming, call `start_streaming(callback)` again with a freshly bound callable -- `reconnect()` will fail because no callback is held. See the `reconnect` docs for the rationale (explicit-handoff model shared with the C++ RAII wrapper and the Python `with`-block `__exit__`; the unified C API keeps the callback by design, but the high-level bindings clear it deliberately)."""
219221
targets = ["python_unified", "typescript_napi"]
220222

221223
[[methods]]
222224
name = "shutdown"
223225
kind = "shutdown"
224-
doc = "Shut down the FPSS streaming connection."
226+
doc = """Shut down the FPSS streaming connection.
227+
228+
On the Python and TypeScript bindings, this clears the registered callback (same explicit-handoff semantics as `stop_streaming`); a subsequent `reconnect()` will fail until the caller re-registers via `start_streaming(callback)`. The C++ binding preserves the unified C API's behaviour."""
225229
targets = ["python_unified", "typescript_napi", "cpp_fpss"]
226230

227231
[[methods]]

0 commit comments

Comments
 (0)