Skip to content

Commit ed0befa

Browse files
committed
Align lookup dispatch error handling
1 parent 6e8c02f commit ed0befa

18 files changed

Lines changed: 231 additions & 117 deletions

.agents/sow/current/SOW-0021-20260613-netipc-at-scale.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1049,6 +1049,14 @@ Tests or equivalent validation:
10491049
- External review found that Go raw APPS_LOOKUP and CGROUPS_LOOKUP dispatch duplicated the protocol dispatch path and missed the protocol layer's `Finish()==0` overflow guard. Fixed `src/go/pkg/netipc/service/raw/apps_lookup.go` and `src/go/pkg/netipc/service/raw/cgroups_lookup.go` to delegate to protocol dispatchers while preserving raw-service `errHandlerFailed` semantics for typed handler failure.
10501050
- Removed the now-unused Go raw-service `lookupMinRequired()` helper after the protocol dispatchers became the single production path for lookup response minimum-size checks.
10511051
- Validation passed: POSIX `cd src/go && go test -count=1 -timeout=300s ./pkg/netipc/service/raw`; POSIX `cd src/go && go test -count=1 -timeout=300s ./pkg/netipc/protocol`; Windows `cd src/go && "/c/Program Files/Go/bin/go.exe" test -count=1 -timeout=300s ./pkg/netipc/service/raw`; Windows focused raw lookup/defaults tests also passed on `win11`.
1052+
- Post-push Rust/Go reviewer fixes:
1053+
- External review found that Rust raw APPS_LOOKUP and CGROUPS_LOOKUP service dispatch still duplicated protocol dispatch logic, unlike C and Go. Fixed `src/crates/netipc/src/service/raw/apps_lookup.rs` and `src/crates/netipc/src/service/raw/cgroups_lookup.rs` to delegate to the Rust protocol dispatchers.
1054+
- External review found that Rust raw lookup dispatch returned `HandlerFailed` before checking whether the builder had already recorded `Overflow`. This could convert a recoverable response-size overflow into an internal service failure. Fixed the Rust protocol-to-service mapping and added platform-neutral tests in `src/crates/netipc/src/service/raw_lookup_dispatch_tests.rs` proving builder `Overflow` wins over handler-failed while plain handler false still maps to service `HandlerFailed`.
1055+
- External review found that C protocol lookup dispatch reports a no-builder-error handler false as `NIPC_ERR_HANDLER_FAILED`, while Go and Rust protocol lookup dispatch reported `BadLayout`. Fixed Go and Rust protocol dispatch to expose `ErrHandlerFailed` / `NipcError::HandlerFailed` for this case while preserving `BadLayout` when the builder itself records a layout error. This is API error parity only; no wire format changes.
1056+
- External review found that Rust `ensure_lookup_request_capacity()` did not check abort before disconnect/reconnect work. Fixed it to return `Aborted`, disconnect, and mark the client `Broken` before any capacity growth attempt when abort is already requested. Added a platform-neutral Rust test proving no reconnect is attempted in this path.
1057+
- Validation passed on POSIX: `cd src/go && go test -count=1 -timeout=300s ./pkg/netipc/protocol`; `cd src/go && go test -count=1 -timeout=300s ./pkg/netipc/service/raw`; `cargo test --manifest-path src/crates/netipc/Cargo.toml protocol::lookup -- --nocapture`; `cargo test --manifest-path src/crates/netipc/Cargo.toml lookup_dispatch_tests -- --nocapture`; `cargo test --manifest-path src/crates/netipc/Cargo.toml service::raw:: -- --nocapture`.
1058+
- Validation passed on Windows `win11`: `cd src/go && "/c/Program Files/Go/bin/go.exe" test -count=1 -timeout=300s ./pkg/netipc/protocol`; `cd src/go && "/c/Program Files/Go/bin/go.exe" test -count=1 -timeout=300s ./pkg/netipc/service/raw`; `PATH=/c/Users/costa/.cargo/bin:$PATH cargo test --manifest-path src/crates/netipc/Cargo.toml protocol::lookup -- --nocapture`; `PATH=/c/Users/costa/.cargo/bin:$PATH cargo test --manifest-path src/crates/netipc/Cargo.toml service::raw:: -- --nocapture`.
1059+
- Reviewer concern about suffix-reservation still losing partial responses was not reproduced: C POSIX `timeout 900 build-coverage/bin/test_service` passed with `642 passed, 0 failed`; focused POSIX Go `TestLookupLargeResponseSplit` passed; focused POSIX Rust `test_lookup_large_response_split_calls` passed; Windows Rust `test_lookup_large_response_split_calls_windows` passed. The concern was treated as rejected unless a failing reproducer is produced.
10521060
- Note: `ctest` on `$PATH` resolved to a broken local Python wrapper missing the `cmake` module; validation used `/usr/bin/ctest`.
10531061

10541062
Real-use evidence:
@@ -1066,7 +1074,13 @@ Reviewer findings:
10661074
- External reviewer concern: hidden fixed payload ceilings would contradict initialization-tunable budgets. Reviewed against code and docs. `NIPC_MAX_PAYLOAD_CAP` / `MaxPayloadCap` remains a named default growth ceiling; request/response payload budgets are exposed through initialization config. Server learned-capacity growth now also honors those configured ceilings.
10671075
- External reviewer finding: Go zero-config raw-server growth ceilings were derived after applying `MaxPayloadDefault`, effectively preventing default servers from learning above `1024` bytes. Handled by deriving growth ceilings from the original config and validating default versus explicit ceiling behavior.
10681076
- External reviewer finding: Go raw lookup dispatchers duplicated protocol dispatcher logic and could diverge from protocol overflow handling. Handled by delegating raw APPS_LOOKUP and CGROUPS_LOOKUP dispatch to protocol dispatchers, preserving raw typed-handler failure semantics, and deleting the now-dead local minimum-size helper.
1069-
- One requested external review run timed out before returning a usable final report. The completed reviewer findings above were triaged against the codebase; concrete defects were fixed and validated, while non-blocking style or broader design suggestions remain subject to the existing SOW close gates.
1077+
- External reviewer finding: Rust raw lookup dispatchers duplicated protocol dispatcher logic and could diverge from protocol overflow handling. Handled by delegating raw APPS_LOOKUP and CGROUPS_LOOKUP dispatch to protocol dispatchers.
1078+
- External reviewer finding: Rust raw lookup service dispatch hid builder `Overflow` when the typed handler returned false after a builder failure. Handled by mapping protocol builder errors before service handler-failed behavior and adding tests proving builder `Overflow` remains `DispatchError::Overflow`.
1079+
- External reviewer finding: C, Go, and Rust protocol lookup dispatch returned different errors for handler false with no builder error. Handled by aligning Go and Rust with C's handler-failed protocol error while preserving builder-error propagation.
1080+
- External reviewer finding: Rust lookup request-capacity growth did not check abort before reconnecting. Handled by checking abort at helper entry and validating no reconnect is attempted after abort.
1081+
- External reviewer concern: APPS_LOOKUP/CGROUPS_LOOKUP suffix-reservation could still lose partial progress if compact `PAYLOAD_EXCEEDED` suffix entries fill the response. Rejected after direct validation: C POSIX full service tests, POSIX Go focused large-response split, POSIX Rust focused large-response split, and Windows Rust focused large-response split all pass with stitched complete responses.
1082+
- External reviewer concern: C's service-layer zero-config response payload default is larger than Rust/Go raw defaults. Triaged as documented policy, not a hidden hardcoded protocol cap: Level 1 default/cap values remain named defaults, C service uses a named service response default, and all languages expose request/response payload budgets through initialization config.
1083+
- Two requested external review runs timed out before returning usable final reports, and one run ended before a final report could be retrieved. Completed reviewer findings were triaged against the codebase; concrete defects were fixed and validated, while non-blocking style or broader design suggestions remain subject to the existing SOW close gates.
10701084
- External reviewer finding: coverage scripts and broader adversarial matrix still need updates before SOW completion. Coverage script expansion is now implemented; C/Rust/Go coverage gates pass; representative `8192` and `32768` logical-call tests now pass in C/Rust/Go on POSIX and Windows; POSIX and Windows baseline/SHM lookup-scale interop now pass across all C/Rust/Go directed pairs, including mixed-status lookup cases and heavier `65536` stress-only runs; lookup status codec interop now proves `PAYLOAD_EXCEEDED` and `OVERSIZED_ITEM` wire parity across C/Rust/Go; Rust malformed typed lookup response parity is now covered on POSIX and Windows; malformed follow-up responses after partial progress are now covered in C/Rust/Go on POSIX and Windows; reordered and duplicate response-item corruption is now covered in C/Rust/Go on POSIX and Windows; invalid status enum, invalid status-dependent field, and invalid label-table corruption are now covered in C/Rust/Go on POSIX and Windows; huge valid label isolation is now covered in C/Rust/Go on POSIX and Windows; endpoint absence before call, endpoint disappearance after partial progress, and endpoint disappearance before the first subcall are now covered in C/Rust/Go on POSIX and Windows; zero-item typed lookup calls are now covered in C/Rust/Go on POSIX and Windows; duplicate and unsorted request keys under request splitting are now covered in C/Rust/Go on POSIX and Windows; full POSIX and Windows benchmark regenerations now pass; downstream topology-containers post-vendor validation now passes. Broader adversarial matrix review remains open.
10711085

10721086
Same-failure scan:

src/crates/netipc/src/protocol/lookup/apps_lookup.rs

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -786,7 +786,7 @@ where
786786
]);
787787
}
788788
if !handler(&request, &mut builder) {
789-
return Err(builder.error().unwrap_or(NipcError::BadLayout));
789+
return Err(builder.error().unwrap_or(NipcError::HandlerFailed));
790790
}
791791
if let Some(err) = builder.error() {
792792
return Err(err);
@@ -1030,6 +1030,17 @@ mod tests {
10301030
);
10311031
}
10321032

1033+
#[test]
1034+
fn apps_lookup_dispatch_reports_handler_failed() {
1035+
let mut req = [0u8; 64];
1036+
let n = encode_apps_lookup_request(&[1234], &mut req).unwrap();
1037+
let mut resp = vec![0u8; 256];
1038+
assert_eq!(
1039+
dispatch_apps_lookup(&req[..n], &mut resp, |_, _| false).unwrap_err(),
1040+
NipcError::HandlerFailed
1041+
);
1042+
}
1043+
10331044
#[test]
10341045
fn apps_lookup_request_rejects_bad_layouts() {
10351046
let mut buf = [0u8; 128];

src/crates/netipc/src/protocol/lookup/cgroups_lookup.rs

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -631,7 +631,7 @@ where
631631
builder.set_payload_exceeded_item_lens(payload_exceeded_item_lens);
632632
}
633633
if !handler(&request, &mut builder) {
634-
return Err(builder.error().unwrap_or(NipcError::BadLayout));
634+
return Err(builder.error().unwrap_or(NipcError::HandlerFailed));
635635
}
636636
if let Some(err) = builder.error() {
637637
return Err(err);
@@ -783,6 +783,17 @@ mod tests {
783783
);
784784
}
785785

786+
#[test]
787+
fn cgroups_lookup_dispatch_reports_handler_failed() {
788+
let mut req = [0u8; 64];
789+
let n = encode_cgroups_lookup_request(&[b"/x"], &mut req).unwrap();
790+
let mut resp = vec![0u8; 256];
791+
assert_eq!(
792+
dispatch_cgroups_lookup(&req[..n], &mut resp, |_, _| false).unwrap_err(),
793+
NipcError::HandlerFailed
794+
);
795+
}
796+
786797
#[test]
787798
fn cgroups_lookup_request_rejects_bad_layouts() {
788799
let mut buf = [0u8; 128];

src/crates/netipc/src/protocol/mod.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,8 @@ pub enum NipcError {
104104
BadItemCount,
105105
/// Builder ran out of space.
106106
Overflow,
107+
/// Typed dispatch handler rejected an otherwise valid request.
108+
HandlerFailed,
107109
/// Synchronous call timed out before a complete response arrived.
108110
Timeout,
109111
/// Synchronous call was aborted by the caller.
@@ -124,6 +126,7 @@ impl core::fmt::Display for NipcError {
124126
NipcError::BadAlignment => write!(f, "item not 8-byte aligned"),
125127
NipcError::BadItemCount => write!(f, "item count inconsistent"),
126128
NipcError::Overflow => write!(f, "builder out of space"),
129+
NipcError::HandlerFailed => write!(f, "dispatch handler failed"),
127130
NipcError::Timeout => write!(f, "synchronous call timed out"),
128131
NipcError::Aborted => write!(f, "synchronous call aborted"),
129132
}

src/crates/netipc/src/protocol/tests.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1311,6 +1311,7 @@ fn nipc_error_display_all_variants() {
13111311
(NipcError::BadAlignment, "item not 8-byte aligned"),
13121312
(NipcError::BadItemCount, "item count inconsistent"),
13131313
(NipcError::Overflow, "builder out of space"),
1314+
(NipcError::HandlerFailed, "dispatch handler failed"),
13141315
];
13151316
for (err, expected) in cases {
13161317
let msg = format!("{}", err);

src/crates/netipc/src/service/raw.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,10 @@ use server_session_unix::poll_fd;
7070
#[path = "raw_unix_tests.rs"]
7171
mod tests;
7272

73+
#[cfg(test)]
74+
#[path = "raw_lookup_dispatch_tests.rs"]
75+
mod lookup_dispatch_tests;
76+
7377
#[cfg(all(test, windows))]
7478
#[path = "raw_windows_tests.rs"]
7579
mod windows_tests;

src/crates/netipc/src/service/raw/apps_lookup.rs

Lines changed: 6 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,10 @@
11
use super::client::{ClientConfig, RawCallKind, RawClient};
22
use super::common::lookup_raw_response_size;
3-
use super::dispatch::{DispatchError, DispatchHandler};
3+
use super::dispatch::{dispatch_error_from_protocol, DispatchHandler};
44
use crate::protocol::{
55
self, AppsLookupBuilder, AppsLookupRequestView, AppsLookupResponseView, NipcError,
6-
APPS_LOOKUP_ITEM_HDR_SIZE, APPS_LOOKUP_KEY_SIZE, APPS_LOOKUP_REQ_HDR_SIZE,
7-
APPS_LOOKUP_RESP_HDR_SIZE, LOOKUP_DIR_ENTRY_SIZE, METHOD_APPS_LOOKUP,
8-
PID_LOOKUP_PAYLOAD_EXCEEDED,
6+
APPS_LOOKUP_KEY_SIZE, APPS_LOOKUP_REQ_HDR_SIZE, APPS_LOOKUP_RESP_HDR_SIZE,
7+
LOOKUP_DIR_ENTRY_SIZE, METHOD_APPS_LOOKUP, PID_LOOKUP_PAYLOAD_EXCEEDED,
98
};
109
use std::sync::Arc;
1110

@@ -182,39 +181,9 @@ impl RawClient {
182181

183182
pub fn apps_lookup_dispatch(handler: AppsLookupHandler) -> DispatchHandler {
184183
Arc::new(move |request, response_buf| {
185-
let request =
186-
AppsLookupRequestView::decode(request).map_err(|_| DispatchError::BadEnvelope)?;
187-
let dir_size = (request.item_count as usize)
188-
.checked_mul(LOOKUP_DIR_ENTRY_SIZE)
189-
.ok_or(DispatchError::Overflow)?;
190-
let min_required = protocol::APPS_LOOKUP_RESP_HDR_SIZE
191-
.checked_add(dir_size)
192-
.ok_or(DispatchError::Overflow)?;
193-
if response_buf.len() < min_required {
194-
return Err(DispatchError::Overflow);
195-
}
196-
let mut builder = AppsLookupBuilder::new(response_buf, request.item_count, 0);
197-
if request.item_count > 0 {
198-
builder.set_payload_exceeded_item_lens(vec![
199-
APPS_LOOKUP_ITEM_HDR_SIZE + 3;
200-
request.item_count as usize
201-
]);
202-
}
203-
if !handler(&request, &mut builder) {
204-
return Err(DispatchError::HandlerFailed);
205-
}
206-
if let Some(err) = builder.error() {
207-
return match err {
208-
NipcError::Overflow => Err(DispatchError::Overflow),
209-
_ => Err(DispatchError::BadEnvelope),
210-
};
211-
}
212-
if builder.item_count() != request.item_count {
213-
return Err(DispatchError::BadEnvelope);
214-
}
215-
builder.finish().map_err(|err| match err {
216-
NipcError::Overflow => DispatchError::Overflow,
217-
_ => DispatchError::BadEnvelope,
184+
protocol::dispatch_apps_lookup(request, response_buf, |request, builder| {
185+
handler(request, builder)
218186
})
187+
.map_err(dispatch_error_from_protocol)
219188
})
220189
}

src/crates/netipc/src/service/raw/cgroups_lookup.rs

Lines changed: 4 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
use super::client::{ClientConfig, RawCallKind, RawClient};
22
use super::common::lookup_raw_response_size;
3-
use super::dispatch::{DispatchError, DispatchHandler};
3+
use super::dispatch::{dispatch_error_from_protocol, DispatchHandler};
44
use crate::protocol::{
55
self, CgroupsLookupBuilder, CgroupsLookupRequestView, CgroupsLookupResponseView, NipcError,
66
CGROUPS_LOOKUP_ITEM_HDR_SIZE, CGROUPS_LOOKUP_REQ_HDR_SIZE, CGROUPS_LOOKUP_RESP_HDR_SIZE,
@@ -233,45 +233,9 @@ impl RawClient {
233233

234234
pub fn cgroups_lookup_dispatch(handler: CgroupsLookupHandler) -> DispatchHandler {
235235
Arc::new(move |request, response_buf| {
236-
let request =
237-
CgroupsLookupRequestView::decode(request).map_err(|_| DispatchError::BadEnvelope)?;
238-
let dir_size = (request.item_count as usize)
239-
.checked_mul(LOOKUP_DIR_ENTRY_SIZE)
240-
.ok_or(DispatchError::Overflow)?;
241-
let min_required = protocol::CGROUPS_LOOKUP_RESP_HDR_SIZE
242-
.checked_add(dir_size)
243-
.ok_or(DispatchError::Overflow)?;
244-
if response_buf.len() < min_required {
245-
return Err(DispatchError::Overflow);
246-
}
247-
let mut builder = CgroupsLookupBuilder::new(response_buf, request.item_count, 0);
248-
if request.item_count > 0 {
249-
let mut payload_exceeded_item_lens = Vec::with_capacity(request.item_count as usize);
250-
for i in 0..request.item_count {
251-
let item = request.item(i).map_err(|_| DispatchError::BadEnvelope)?;
252-
let item_len = CGROUPS_LOOKUP_ITEM_HDR_SIZE
253-
.checked_add(item.as_bytes().len())
254-
.and_then(|v| v.checked_add(2))
255-
.ok_or(DispatchError::Overflow)?;
256-
payload_exceeded_item_lens.push(item_len);
257-
}
258-
builder.set_payload_exceeded_item_lens(payload_exceeded_item_lens);
259-
}
260-
if !handler(&request, &mut builder) {
261-
return Err(DispatchError::HandlerFailed);
262-
}
263-
if let Some(err) = builder.error() {
264-
return match err {
265-
NipcError::Overflow => Err(DispatchError::Overflow),
266-
_ => Err(DispatchError::BadEnvelope),
267-
};
268-
}
269-
if builder.item_count() != request.item_count {
270-
return Err(DispatchError::BadEnvelope);
271-
}
272-
builder.finish().map_err(|err| match err {
273-
NipcError::Overflow => DispatchError::Overflow,
274-
_ => DispatchError::BadEnvelope,
236+
protocol::dispatch_cgroups_lookup(request, response_buf, |request, builder| {
237+
handler(request, builder)
275238
})
239+
.map_err(dispatch_error_from_protocol)
276240
})
277241
}

src/crates/netipc/src/service/raw/client.rs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,12 @@ impl RawClient {
226226
&mut self,
227227
required: usize,
228228
) -> Result<bool, NipcError> {
229+
if self.abort_requested() {
230+
self.disconnect();
231+
self.state = ClientState::Broken;
232+
self.error_count += 1;
233+
return Err(NipcError::Aborted);
234+
}
229235
let required = u32::try_from(required).map_err(|_| NipcError::Overflow)?;
230236
let max_request_payload_bytes = if self.transport_config.max_request_payload_bytes == 0 {
231237
crate::protocol::MAX_PAYLOAD_DEFAULT

src/crates/netipc/src/service/raw/dispatch.rs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
use super::common::next_power_of_2_u32;
2+
use crate::protocol::NipcError;
23
use std::sync::atomic::{AtomicU32, Ordering};
34
use std::sync::Arc;
45

@@ -12,6 +13,14 @@ pub enum DispatchError {
1213
pub type DispatchHandler =
1314
Arc<dyn Fn(&[u8], &mut [u8]) -> Result<usize, DispatchError> + Send + Sync>;
1415

16+
pub(super) fn dispatch_error_from_protocol(err: NipcError) -> DispatchError {
17+
match err {
18+
NipcError::Overflow => DispatchError::Overflow,
19+
NipcError::HandlerFailed => DispatchError::HandlerFailed,
20+
_ => DispatchError::BadEnvelope,
21+
}
22+
}
23+
1524
pub(super) fn dispatch_single_internal(
1625
expected_method_code: u16,
1726
handler: Option<&DispatchHandler>,

0 commit comments

Comments
 (0)