Fix CI: Windows errno clobbering, event loop timeout, socket test race, NIO connect verification#434
Conversation
…tinePool, skip preemptive on ARM Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/781aee65-5255-455a-912c-4ab565340ca8 Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #434 +/- ##
==========================================
+ Coverage 75.32% 75.74% +0.41%
==========================================
Files 44 44
Lines 4698 4709 +11
==========================================
+ Hits 3539 3567 +28
+ Misses 1159 1142 -17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/b3dc5ac0-1d31-477b-861f-e71ed72a279c Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/5985d417-fe0c-435b-8f00-24092713b948 Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
loongs-zhang
left a comment
There was a problem hiding this comment.
Focus on did not completed within the specified time and fix them.
Add CoroutinePool::is_local_empty() method and use it in event loop thread exit condition and do_stop_sync() shutdown check. After PR #432, is_empty() checks all queues globally (including siblings), causing the event loop to block until all sibling queues drain. Using is_local_empty() restores local-only exit semantics so each event loop exits when its own work is done. Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/a25cbffc-7ce7-4db8-a618-776a75374e56 Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
Fixed in 5722944. Added |
loongs-zhang
left a comment
There was a problem hiding this comment.
Do not modify the code I have changed, continue to investigate and fix the CI errors.
…connects Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/a2ca3c38-53d1-46ce-9e2c-d5e91dba0877 Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
Understood — |
loongs-zhang
left a comment
There was a problem hiding this comment.
I think your change is effective, but strangely, according to https://github.com/acl-dev/open-coroutine/actions/runs/23886892084/job/69651487944, we see Thu, 02 Apr 2026 14:38:35 +0800 INFO ThreadId(04) open_coroutine_core::syscall::windows::connect: 22: exit syscall connect 0 The operation completed successfully. (os error 0), then why Thu, 02 Apr 2026 14:38:35 +0800 INFO ThreadId(04) open_coroutine_core::syscall::windows::send: 20: exit syscall send -1 A request to send or receive data was disallowed because the socket is not connected and (when sending on a datagram socket using a sendto call) no address was supplied. (os error 10057)?
…rt conflicts on CI Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/686cdaf5-d091-4828-8fbf-0095b00365ee Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
Analyzed the CI log from that run. The NIO connect handler ( Two issues contribute:
Fixed in |
loongs-zhang
left a comment
There was a problem hiding this comment.
Can NioConnectSyscall add code to ensure success and avoid fake success? Both check them on unix and windows.
…oth Unix and Windows Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/da16b3c8-429d-4dd1-b997-44804647bbc2 Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
Added in |
loongs-zhang
left a comment
There was a problem hiding this comment.
How about simply swapping the order of getsockopt and getpeername in the original code? Both on unix and windows.
…irst, SO_ERROR last as final authority Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/1eefc664-521f-4198-9f12-9833642e6633 Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
Done in |
loongs-zhang
left a comment
There was a problem hiding this comment.
Good, now I need you to remove all let port = 7060; and let port = 7061; on mod operator, just like your modifications to socket_co.
…event port conflicts Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/034a9a08-a55e-4f25-b34b-5d4982dfa51a Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
Done in |
CI run 23856315454 fails across Windows socket tests and emulated Linux targets. Four distinct root causes identified.
Windows
impl_facade!errno clobberingThe facade macro logs via
tracing::info!()after the NIO layer returns, which triggersCreateFileWthrough tracing-appender. This clobbersGetLastError(), so callers see stale errors (e.g., error 6 "The handle is invalid" from aCreateFileWfailure) instead of the actual socket error.Fix: capture
GetLastError()immediately after the inner syscall, use the saved value for logging, restore before returning.Event loop timeout ("did not completed within the specified time")
PR #432 renamed
OrderedLocalQueue::len()→local_len()and added a newlen()that iterates all queues (local + shared + siblings).is_empty()now means "entire work-steal system is empty" rather than "this pool's local queue is empty." The event loop exit condition!consumer.is_empty()anddo_stop_sync()check block until all sibling queues drain — causing timeouts on Windows and slower emulated targets.Fix: add a new
CoroutinePool::is_local_empty()method that delegates totask_queue.is_local_empty(), and use it in the two event loop exit conditions (event_loop.rslines 394 and 447).CoroutinePool::is_empty(),try_grow(), and the Drop impl remain unchanged with global semantics.Socket test race condition and port conflicts
All four socket test examples (
socket_not_co,socket_co,socket_co_server,socket_co_client) and the operator tests (core/src/net/operator/linux/tests.rs,core/src/net/operator/windows/tests.rs) had two issues:Race condition: Server and client threads spawned simultaneously without ensuring the server is listening before the client connects. On Windows, the first syscall through the tracing facade takes ~1s (tracing-appender's
CreateFileWinitialization), making this race nearly 100% reproducible.Port conflicts: Hardcoded ports (7060, 7061, 8888, 8889, 8899, 8999) can collide with other services on CI runners. The NIO connect handler may report a false success (
connectreturns 0) against a foreign listener, but the subsequentsendfails with WSAENOTCONN (10057) because the foreign process resets the connection. This also interacts withEventLoops::wait_write_event()returning on any event from the shared selector (not specifically the write event for the connecting fd), which can produce false positives fromgetsockopt(SO_ERROR)during a pending connect.Fix: use dynamic port allocation (
TcpListener::bind("127.0.0.1:0")) and pass the OS-assignedSocketAddrfrom server to client through aMutex<Option<SocketAddr>>condvar barrier. The server signals afterTcpListener::bind()returns with the actual address, and the client waits for this signal before attempting to connect. This eliminates both the race condition and port conflict issues. Applied to all socket examples and both Linux and Windows operator tests.NIO connect false success prevention
NioConnectSyscallon both Unix and Windows could report a false connect success. WhenEventLoops::wait_write_event()returns due to an unrelated event on the shared selector while the connect is still pending,getsockopt(SO_ERROR)reads 0 (no error yet) andgetpeername()may transiently succeed, causing the handler to report success. The subsequentsendthen fails with ENOTCONN/WSAENOTCONN (10057).Fix: swap the order of
getsockopt(SO_ERROR)andgetpeername()in the NIO connect handler on both Unix and Windows.getpeername()is now checked first as a quick connectivity test, andgetsockopt(SO_ERROR)is checked last as the final authority. Ifgetpeername()fails butSO_ERRORis 0, the connect is still pending — errno is set toEINPROGRESS/WSAEINPROGRESSand the wait loop continues. This ensuresSO_ERRORalways has the last word on connection status.