fix: retry PTY drain on empty select() to prevent data loss on macOS#826
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PTY output drain logic in ChangesPTY drain consecutive-empty retry
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
289c42c to
c24c73a
Compare
The PTY drain logic in read_pty_output's finally block broke out of the drain loop after a single empty select() poll (100ms). On macOS CI runners under load (make test -j4 with concurrent package installs), PTY kernel buffer delivery can be delayed beyond 100ms after subprocess exit, causing the drain to miss the output entirely. Replace the immediate break with a consecutive empty poll counter that requires DRAIN_MAX_EMPTY_POLLS (10) empty select() calls (~1s) before concluding the drain is complete. This extends the tolerance window while still respecting the existing DRAIN_TIMEOUT_SECONDS (2s) and MAX_DRAIN_BYTES limits. Isolated reproduction confirms: - Original: 0/20 pass rate when data arrives >100ms after drain starts - Fixed: 20/20 pass rate for delays up to 250ms+ Fixes #821
c24c73a to
7a8ef9a
Compare
Add three tests exercising the consecutive empty poll counter in the drain loop, as suggested in PR review: - test_drain_retries_empty_select_then_captures_data: verifies data is captured after N empty select() calls (N < DRAIN_MAX_EMPTY_POLLS) - test_drain_terminates_after_max_empty_polls: verifies the loop terminates after DRAIN_MAX_EMPTY_POLLS consecutive empties - test_drain_empty_counter_resets_on_data: verifies the counter resets when data arrives between empty polls (empty-data-empty pattern)
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py (1)
905-1025: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winStrengthen assertions to validate drain-path behavior, not just main-loop output.
At Line 943, Line 980, and Line 1024, assertions only verify normal hook output, which can be logged before drain starts. These tests can still pass if consecutive-empty retry/reset behavior regresses. Assert on post-EOF drain marker presence/absence to directly validate the new logic.
Proposed assertion-focused diff
@@ async def test_drain_retries_empty_select_then_captures_data(self, lease_scope) -> None: @@ with ( @@ ): result = await executor.execute_before_lease_hook(lease_scope) assert result is None info_calls = [str(call) for call in mock_logger.info.call_args_list] assert any("DELAYED_DRAIN_OK" in call for call in info_calls) + # Proves drain eventually read post-EOF data after initial empty polls. + assert any("SHOULD_NOT_APPEAR" in call for call in info_calls) @@ - state = _PtyTracker(return_drain_data=False) + state = _PtyTracker() @@ with ( @@ ): result = await executor.execute_before_lease_hook(lease_scope) assert result is None # Main loop should have captured the output before drain info_calls = [str(call) for call in mock_logger.info.call_args_list] assert any("MAX_EMPTY_TEST" in call for call in info_calls) + # Proves drain terminated on max consecutive empties before reading drain data. + assert not any("SHOULD_NOT_APPEAR" in call for call in info_calls) @@ with ( @@ ): result = await executor.execute_before_lease_hook(lease_scope) assert result is None info_calls = [str(call) for call in mock_logger.info.call_args_list] assert any("INTERLEAVE_TEST" in call for call in info_calls) + # Proves empty counter reset allowed later readable cycle in drain. + assert any("SHOULD_NOT_APPEAR" in call for call in info_calls)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py` around lines 905 - 1025, The three test methods test_drain_retries_empty_select_then_captures_data, test_drain_terminates_after_max_empty_polls, and test_drain_empty_counter_resets_on_data currently assert only on normal hook output (the echo commands), which executes before drain starts and doesn't validate drain-specific behavior. Strengthen each test by adding assertions that check for post-EOF drain markers or drain-specific logging. For test_drain_retries_empty_select_then_captures_data, verify drain data capture after empty polls. For test_drain_terminates_after_max_empty_polls, verify that drain data is absent to confirm termination occurred. For test_drain_empty_counter_resets_on_data, verify drain handling of the interleaved empty/data pattern. Replace the generic output checks with assertions that directly validate the drain logic being tested.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py`:
- Around line 905-1025: The three test methods
test_drain_retries_empty_select_then_captures_data,
test_drain_terminates_after_max_empty_polls, and
test_drain_empty_counter_resets_on_data currently assert only on normal hook
output (the echo commands), which executes before drain starts and doesn't
validate drain-specific behavior. Strengthen each test by adding assertions that
check for post-EOF drain markers or drain-specific logging. For
test_drain_retries_empty_select_then_captures_data, verify drain data capture
after empty polls. For test_drain_terminates_after_max_empty_polls, verify that
drain data is absent to confirm termination occurred. For
test_drain_empty_counter_resets_on_data, verify drain handling of the
interleaved empty/data pattern. Replace the generic output checks with
assertions that directly validate the drain logic being tested.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 368bca62-70d7-4999-b750-6cba07db7185
📒 Files selected for processing (2)
python/packages/jumpstarter/jumpstarter/exporter/hooks.pypython/packages/jumpstarter/jumpstarter/exporter/hooks_test.py
🚧 Files skipped from review as they are similar to previous changes (1)
- python/packages/jumpstarter/jumpstarter/exporter/hooks.py
|
@mangelajo It seems like the beforeLease hook seems to be failing in the Ubuntu 24 tests here now... |
it's the BLE test that fails :D (we are fixing it on the other PR...) |
|
@mangelajo Humm, now macOS pytests are failing due to failed assertions of output. Ahh, the BLE makes sense :) |
Add mathio/gha-cleanup to remove unused preinstalled tools (Java, .NET, Android SDK, browsers, etc.) and prune Docker to free disk space on GitHub Actions runners. This reduces resource contention during parallel test execution. Pinned to SHA for supply-chain safety. Renovate will auto-detect it via the existing helpers:pinGitHubActionDigests preset and the github-actions-other package rule.
## Problem The alleged macOS PTY kernel buffer timing race condition continues to cause random test failures in CI despite multiple fix attempts: - #560 — Added synchronous drain after async reader cancellation - #733 — Added `select()`-based drain with timeout - #826 — Increased `DRAIN_MAX_EMPTY_POLLS` from 1 to 10 **Latest failing CI run:** https://github.com/jumpstarter-dev/jumpstarter/actions/runs/28084374100/job/83146665529 Each fix reduced the frequency but did not eliminate the race. The CI runs parallel test suites with concurrent package installations, creating enough I/O contention for the macOS PTY buffer to miss the drain window. We haven't managed to reproduce it locally, but it happens on CI. ## Workaround Mark all 19 tests that spawn real subprocesses via PTY and assert on captured logger output as `xfail(strict=False)` on macOS. This unblocks CI while the root cause is investigated. Using `strict=False` means: - If the test **passes** on macOS → reported as `xpass` (no failure) - If the test **fails** on macOS → reported as `xfail` (no failure) - On **Linux** → marker has no effect, tests run normally ### Tests marked with `@macos_pty_xfail` **Output capture tests:** - `test_hook_environment_variables` - `test_real_time_output_logging` - `test_post_lease_hook_execution_on_completion` - `test_exec_bash` - `test_exec_python3` - `test_script_file_sh` - `test_script_file_py_autodetects_python` - `test_script_file_py_exec_override` - `test_noninteractive_environment` - `test_drain_captures_output_without_trailing_newline` **Drain behavior tests (patched PTY):** - `test_drain_reads_data_remaining_in_pty_buffer` - `test_drain_select_oserror_exits_gracefully` - `test_drain_select_valueerror_exits_gracefully` - `test_drain_exits_when_deadline_exceeded_before_select` - `test_drain_exception_is_suppressed` - `test_drain_retries_empty_select_then_captures_data` - `test_drain_terminates_after_max_empty_polls` - `test_drain_empty_counter_resets_on_data` **Regression tests:** - `test_infrastructure_messages_at_debug_not_info` ## Related - Re-opened #821 with updated description and moved to **0.10.0** milestone - Added comments to #733 and #560 noting the issue persists Handles #821 (targeted for proper fix in 0.10.0)
Replace start_new_session=True with process_group=0 to prevent macOS PTY revocation on subprocess exit. Restructure the reader loop to always attempt os.read() before checking the stop flag, preventing event-loop scheduling starvation from skipping all reads. These two changes address the root cause of the flaky macOS PTY tests (jumpstarter-dev#560, jumpstarter-dev#733, jumpstarter-dev#821, jumpstarter-dev#826) rather than the symptoms. The drain retry logic from jumpstarter-dev#826 is no longer needed and is removed. Closes jumpstarter-dev#821 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace start_new_session=True with process_group=0 to prevent macOS PTY revocation on subprocess exit. Restructure the reader loop to always attempt os.read() before checking the stop flag, preventing event-loop scheduling starvation from skipping all reads. These two changes address the root cause of the flaky macOS PTY tests (jumpstarter-dev#560, jumpstarter-dev#733, jumpstarter-dev#821, jumpstarter-dev#826) rather than the symptoms. The drain retry logic from jumpstarter-dev#826 is no longer needed and is removed. Closes jumpstarter-dev#821 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Problem
test_exec_bashis flaky on macOS CI runners (macos-15, Python 3.11). The PTY drain logic gives up after a single emptyselect()poll (100ms), but on loaded CI runners, macOS PTY kernel buffer delivery can be delayed beyond 100ms after subprocess exit.Failing CI run: https://github.com/jumpstarter-dev/jumpstarter/actions/runs/27971589029/job/82778910902
This is the third occurrence of this issue (#560, #733, #821). The
select()-based drain from #733 was the right approach but insufficient — it breaks after a single empty poll.Root Cause
In the
read_pty_outputfinally block, the drain loop does:CI runs
make test -j4(4 parallel test suites) with concurrentuvpackage installations, creating enough I/O and CPU contention for the macOS PTY buffer to miss this single 100ms window.Fix
Add a
DRAIN_MAX_EMPTY_POLLS = 10constant and track consecutive emptyselect()polls, only giving up after 10 in a row (~1s worst-case). This is well within the existingDRAIN_TIMEOUT_SECONDS(2s) deadline and gives ample headroom for heavily-loaded CI runners.Reproduction
Isolated the drain logic and tested with delayed PTY writes:
When data arrives after 105ms, the original drain has a 100% failure rate. The fix passes all cases.
Testing
test_exec_bashin isolation (Python 3.11)Fixes #821