Skip to content

fix: retry PTY drain on empty select() to prevent data loss on macOS#826

Merged
mangelajo merged 3 commits into
mainfrom
fix/pty-drain-race-821
Jun 23, 2026
Merged

fix: retry PTY drain on empty select() to prevent data loss on macOS#826
mangelajo merged 3 commits into
mainfrom
fix/pty-drain-race-821

Conversation

@mangelajo

@mangelajo mangelajo commented Jun 23, 2026

Copy link
Copy Markdown
Member

Problem

test_exec_bash is flaky on macOS CI runners (macos-15, Python 3.11). The PTY drain logic gives up after a single empty select() poll (100ms), but on loaded CI runners, macOS PTY kernel buffer delivery can be delayed beyond 100ms after subprocess exit.

Failing CI run: https://github.com/jumpstarter-dev/jumpstarter/actions/runs/27971589029/job/82778910902

This is the third occurrence of this issue (#560, #733, #821). The select()-based drain from #733 was the right approach but insufficient — it breaks after a single empty poll.

Root Cause

In the read_pty_output finally block, the drain loop does:

if not readable:
    # Timed out with no data — drain is complete
    break  # <-- gives up after ONE empty 100ms poll

CI runs make test -j4 (4 parallel test suites) with concurrent uv package installations, creating enough I/O and CPU contention for the macOS PTY buffer to miss this single 100ms window.

Fix

Add a DRAIN_MAX_EMPTY_POLLS = 10 constant and track consecutive empty select() polls, only giving up after 10 in a row (~1s worst-case). This is well within the existing DRAIN_TIMEOUT_SECONDS (2s) deadline and gives ample headroom for heavily-loaded CI runners.

Reproduction

Isolated the drain logic and tested with delayed PTY writes:

Delayed PTY write: original drain vs fixed drain
   Delay      Original         Fixed
--------------------------------------
     0ms          PASS          PASS
    50ms          PASS          PASS
    80ms          PASS          PASS
   105ms          0/20          PASS
   120ms          0/20          PASS
   150ms          0/20          PASS
   200ms          0/20          PASS
   250ms          0/20          PASS

When data arrives after 105ms, the original drain has a 100% failure rate. The fix passes all cases.

Testing

  • All 52 hooks tests pass (Python 3.11, macOS)
  • 100/100 pass rate for test_exec_bash in isolation (Python 3.11)
  • 41/41 pass rate under CPU + I/O stress (Python 3.11)

Fixes #821

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 079f4b01-3c97-44b5-a7f3-13176bdfa0dd

📥 Commits

Reviewing files that changed from the base of the PR and between 29dc538 and 9f530d8.

📒 Files selected for processing (1)
  • .github/workflows/python-tests.yaml

📝 Walkthrough

Walkthrough

The PTY output drain logic in hooks.py is updated to handle macOS timing races after subprocess exit. The drain loop now tracks consecutive empty select() polls and only terminates after reaching a threshold, replacing the prior behavior that exited immediately on the first empty poll. Comprehensive regression tests validate the retry logic against delayed PTY buffer delivery scenarios. CI runner disk space is freed before test execution to reduce contention.

Changes

PTY drain consecutive-empty retry

Layer / File(s) Summary
PTY drain retry mechanism
python/packages/jumpstarter/jumpstarter/exporter/hooks.py
Introduces DRAIN_MAX_EMPTY_POLLS constant to bound retry attempts; initializes consecutive_empty counter during drain phase; replaces immediate break-on-empty with increment-and-continue logic that resets the counter when data becomes readable.
Drain retry test coverage
python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py
Imports the new constant and expands constants validation to assert DRAIN_MAX_EMPTY_POLLS == 10; adds three regression tests simulating empty select() sequences, verifying continuation until threshold, and confirming counter resets on readable data.
CI runner disk space cleanup
.github/workflows/python-tests.yaml
Adds a disk-cleanup step to the pytest-matrix job before test execution to reduce I/O and CPU contention on shared runner resources.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • jumpstarter-dev/jumpstarter#561: Modifies the same read_pty_output() PTY drain path in hooks.py, introducing bounded drain via MAX_DRAIN_BYTES/DRAIN_TIMEOUT_SECONDS that this PR further refines with the consecutive-empty retry strategy.

Suggested reviewers

  • kirkbrauer

Poem

🐇 Hoppity hop through the PTY pipe,
One empty poll? No, don't gripe!
Count to ten on misses in a row,
Let the buffer's data safely flow.
macOS races lose their bite — hooray! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main fix: retrying PTY drain on empty select() to prevent data loss on macOS. It is directly related to the primary changeset.
Description check ✅ Passed The description is well-related to the changeset, clearly explaining the problem, root cause, fix, reproduction steps, and testing results.
Linked Issues check ✅ Passed The PR fully addresses issue #821 by implementing the retry logic for PTY drain with DRAIN_MAX_EMPTY_POLLS constant, retrying up to 10 consecutive empty polls instead of giving up after one, and includes comprehensive tests validating the fix.
Out of Scope Changes check ✅ Passed The GitHub Actions workflow change adding disk cleanup is directly related to addressing the root CI conditions (resource contention) that trigger the PTY race condition, making it in-scope and complementary to the core fix.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/pty-drain-race-821

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@mangelajo mangelajo force-pushed the fix/pty-drain-race-821 branch from 289c42c to c24c73a Compare June 23, 2026 11:28
The PTY drain logic in read_pty_output's finally block broke out of the
drain loop after a single empty select() poll (100ms). On macOS CI
runners under load (make test -j4 with concurrent package installs),
PTY kernel buffer delivery can be delayed beyond 100ms after subprocess
exit, causing the drain to miss the output entirely.

Replace the immediate break with a consecutive empty poll counter that
requires DRAIN_MAX_EMPTY_POLLS (10) empty select() calls (~1s) before
concluding the drain is complete. This extends the tolerance window
while still respecting the existing DRAIN_TIMEOUT_SECONDS (2s) and
MAX_DRAIN_BYTES limits.

Isolated reproduction confirms:
- Original: 0/20 pass rate when data arrives >100ms after drain starts
- Fixed: 20/20 pass rate for delays up to 250ms+

Fixes #821
Comment thread python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py
Add three tests exercising the consecutive empty poll counter in the
drain loop, as suggested in PR review:

- test_drain_retries_empty_select_then_captures_data: verifies data is
  captured after N empty select() calls (N < DRAIN_MAX_EMPTY_POLLS)
- test_drain_terminates_after_max_empty_polls: verifies the loop
  terminates after DRAIN_MAX_EMPTY_POLLS consecutive empties
- test_drain_empty_counter_resets_on_data: verifies the counter resets
  when data arrives between empty polls (empty-data-empty pattern)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py (1)

905-1025: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Strengthen assertions to validate drain-path behavior, not just main-loop output.

At Line 943, Line 980, and Line 1024, assertions only verify normal hook output, which can be logged before drain starts. These tests can still pass if consecutive-empty retry/reset behavior regresses. Assert on post-EOF drain marker presence/absence to directly validate the new logic.

Proposed assertion-focused diff
@@
     async def test_drain_retries_empty_select_then_captures_data(self, lease_scope) -> None:
@@
         with (
@@
         ):
             result = await executor.execute_before_lease_hook(lease_scope)
             assert result is None
             info_calls = [str(call) for call in mock_logger.info.call_args_list]
             assert any("DELAYED_DRAIN_OK" in call for call in info_calls)
+            # Proves drain eventually read post-EOF data after initial empty polls.
+            assert any("SHOULD_NOT_APPEAR" in call for call in info_calls)

@@
-        state = _PtyTracker(return_drain_data=False)
+        state = _PtyTracker()
@@
         with (
@@
         ):
             result = await executor.execute_before_lease_hook(lease_scope)
             assert result is None
             # Main loop should have captured the output before drain
             info_calls = [str(call) for call in mock_logger.info.call_args_list]
             assert any("MAX_EMPTY_TEST" in call for call in info_calls)
+            # Proves drain terminated on max consecutive empties before reading drain data.
+            assert not any("SHOULD_NOT_APPEAR" in call for call in info_calls)

@@
         with (
@@
         ):
             result = await executor.execute_before_lease_hook(lease_scope)
             assert result is None
             info_calls = [str(call) for call in mock_logger.info.call_args_list]
             assert any("INTERLEAVE_TEST" in call for call in info_calls)
+            # Proves empty counter reset allowed later readable cycle in drain.
+            assert any("SHOULD_NOT_APPEAR" in call for call in info_calls)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py` around lines
905 - 1025, The three test methods
test_drain_retries_empty_select_then_captures_data,
test_drain_terminates_after_max_empty_polls, and
test_drain_empty_counter_resets_on_data currently assert only on normal hook
output (the echo commands), which executes before drain starts and doesn't
validate drain-specific behavior. Strengthen each test by adding assertions that
check for post-EOF drain markers or drain-specific logging. For
test_drain_retries_empty_select_then_captures_data, verify drain data capture
after empty polls. For test_drain_terminates_after_max_empty_polls, verify that
drain data is absent to confirm termination occurred. For
test_drain_empty_counter_resets_on_data, verify drain handling of the
interleaved empty/data pattern. Replace the generic output checks with
assertions that directly validate the drain logic being tested.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py`:
- Around line 905-1025: The three test methods
test_drain_retries_empty_select_then_captures_data,
test_drain_terminates_after_max_empty_polls, and
test_drain_empty_counter_resets_on_data currently assert only on normal hook
output (the echo commands), which executes before drain starts and doesn't
validate drain-specific behavior. Strengthen each test by adding assertions that
check for post-EOF drain markers or drain-specific logging. For
test_drain_retries_empty_select_then_captures_data, verify drain data capture
after empty polls. For test_drain_terminates_after_max_empty_polls, verify that
drain data is absent to confirm termination occurred. For
test_drain_empty_counter_resets_on_data, verify drain handling of the
interleaved empty/data pattern. Replace the generic output checks with
assertions that directly validate the drain logic being tested.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 368bca62-70d7-4999-b750-6cba07db7185

📥 Commits

Reviewing files that changed from the base of the PR and between 289c42c and 29dc538.

📒 Files selected for processing (2)
  • python/packages/jumpstarter/jumpstarter/exporter/hooks.py
  • python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/packages/jumpstarter/jumpstarter/exporter/hooks.py

@kirkbrauer

Copy link
Copy Markdown
Member

@mangelajo It seems like the beforeLease hook seems to be failing in the Ubuntu 24 tests here now...

@mangelajo

Copy link
Copy Markdown
Member Author

@mangelajo It seems like the beforeLease hook seems to be failing in the Ubuntu 24 tests here now...

it's the BLE test that fails :D (we are fixing it on the other PR...)

@mangelajo mangelajo enabled auto-merge (squash) June 23, 2026 15:33
@kirkbrauer

kirkbrauer commented Jun 23, 2026

Copy link
Copy Markdown
Member

@mangelajo Humm, now macOS pytests are failing due to failed assertions of output.

Ahh, the BLE makes sense :)

Add mathio/gha-cleanup to remove unused preinstalled tools (Java,
.NET, Android SDK, browsers, etc.) and prune Docker to free disk
space on GitHub Actions runners. This reduces resource contention
during parallel test execution.

Pinned to SHA for supply-chain safety. Renovate will auto-detect it
via the existing helpers:pinGitHubActionDigests preset and the
github-actions-other package rule.
@mangelajo mangelajo merged commit 86c308e into main Jun 23, 2026
37 of 39 checks passed
@mangelajo mangelajo deleted the fix/pty-drain-race-821 branch June 23, 2026 17:29
github-merge-queue Bot pushed a commit that referenced this pull request Jun 25, 2026
## Problem

The alleged macOS PTY kernel buffer timing race condition continues to
cause random test failures in CI despite multiple fix attempts:
- #560 — Added synchronous drain after async reader cancellation
- #733 — Added `select()`-based drain with timeout
- #826 — Increased `DRAIN_MAX_EMPTY_POLLS` from 1 to 10

**Latest failing CI run:**
https://github.com/jumpstarter-dev/jumpstarter/actions/runs/28084374100/job/83146665529

Each fix reduced the frequency but did not eliminate the race. The CI
runs parallel test suites with concurrent package installations,
creating enough I/O contention for the macOS PTY buffer to miss the
drain window. We haven't managed to reproduce it locally, but it happens
on CI.

## Workaround

Mark all 19 tests that spawn real subprocesses via PTY and assert on
captured logger output as `xfail(strict=False)` on macOS. This unblocks
CI while the root cause is investigated.

Using `strict=False` means:
- If the test **passes** on macOS → reported as `xpass` (no failure)
- If the test **fails** on macOS → reported as `xfail` (no failure)
- On **Linux** → marker has no effect, tests run normally

### Tests marked with `@macos_pty_xfail`

**Output capture tests:**
- `test_hook_environment_variables`
- `test_real_time_output_logging`
- `test_post_lease_hook_execution_on_completion`
- `test_exec_bash`
- `test_exec_python3`
- `test_script_file_sh`
- `test_script_file_py_autodetects_python`
- `test_script_file_py_exec_override`
- `test_noninteractive_environment`
- `test_drain_captures_output_without_trailing_newline`

**Drain behavior tests (patched PTY):**
- `test_drain_reads_data_remaining_in_pty_buffer`
- `test_drain_select_oserror_exits_gracefully`
- `test_drain_select_valueerror_exits_gracefully`
- `test_drain_exits_when_deadline_exceeded_before_select`
- `test_drain_exception_is_suppressed`
- `test_drain_retries_empty_select_then_captures_data`
- `test_drain_terminates_after_max_empty_polls`
- `test_drain_empty_counter_resets_on_data`

**Regression tests:**
- `test_infrastructure_messages_at_debug_not_info`

## Related

- Re-opened #821 with updated description and moved to **0.10.0**
milestone
- Added comments to #733 and #560 noting the issue persists

Handles #821 (targeted for proper fix in 0.10.0)
raballew added a commit to raballew/jumpstarter that referenced this pull request Jun 25, 2026
Replace start_new_session=True with process_group=0 to prevent macOS
PTY revocation on subprocess exit. Restructure the reader loop to
always attempt os.read() before checking the stop flag, preventing
event-loop scheduling starvation from skipping all reads.

These two changes address the root cause of the flaky macOS PTY tests
(jumpstarter-dev#560, jumpstarter-dev#733, jumpstarter-dev#821, jumpstarter-dev#826) rather than the symptoms. The drain retry
logic from jumpstarter-dev#826 is no longer needed and is removed.

Closes jumpstarter-dev#821

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
raballew added a commit to raballew/jumpstarter that referenced this pull request Jun 30, 2026
Replace start_new_session=True with process_group=0 to prevent macOS
PTY revocation on subprocess exit. Restructure the reader loop to
always attempt os.read() before checking the stop flag, preventing
event-loop scheduling starvation from skipping all reads.

These two changes address the root cause of the flaky macOS PTY tests
(jumpstarter-dev#560, jumpstarter-dev#733, jumpstarter-dev#821, jumpstarter-dev#826) rather than the symptoms. The drain retry
logic from jumpstarter-dev#826 is no longer needed and is removed.

Closes jumpstarter-dev#821

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test_exec_bash in hooks_test.py on macOS still reproducing after previous fixes

4 participants