Skip to content

test(windows): test infrastructure + Windows production fixes (stacked on #946)#947

Open
lkomali wants to merge 7 commits into
lkomali/aiperf-windows-portfrom
lkomali/aiperf-windows-test-cleanup
Open

test(windows): test infrastructure + Windows production fixes (stacked on #946)#947
lkomali wants to merge 7 commits into
lkomali/aiperf-windows-portfrom
lkomali/aiperf-windows-test-cleanup

Conversation

@lkomali
Copy link
Copy Markdown
Contributor

@lkomali lkomali commented May 15, 2026

Summary

Adds the Windows production fixes that surfaced through testing + the test infrastructure changes to make unit/integration tests pass on Windows.

Production fixes (uncovered through testing)

  • SO_SNDBUF/SO_RCVBUF skip on Windows — without this, aiohttp stalls 9+ minutes at --concurrency 6 on Py 3.13
  • signal.SIGKILLSIGTERM fallback on Windows (no SIGKILL on Windows; would raise AttributeError)
  • Subprocess stdio redirect to NUL on Windows + per-process stderr file (UUID-suffixed, atexit-cleaned if empty)
  • path.is_absolute()path.anchor for Windows path detection (WindowsPath("/tmp/foo").is_absolute() == False)
  • DCGM auto-skip on Windows when --gpu-telemetry not passed (DCGM is Linux-only; saves ~200MB per run)
  • --ui dashboard fallback to --ui simple on non-TTY (Textual hangs on piped stdout on Windows)
  • Wait for all spawned services (not just required) before configure broadcast — fixes silent data loss for optional services on slow targets
  • URL parser drive-letter filter (urlparse("C:\\...") would crash)
  • UTF-8 enforced on log FileHandler, dataset loader file reads, GPU metrics CSV reads, test file IO
  • ASCII bullets in console exporters (cp1252 console safety)
  • Post-shutdown reporting guarded against Rich rendering errors

Test infrastructure

  • AIP-896 quote stripping in test harness shlex
  • Integration test timeouts bumped for slow Windows VDI + pytest-rerunfailures auto-retry for xdist races
  • Windows-specific skips: ctrl_c (CTRL_C_EVENT mapping), 100-worker stress (WinError 1450), DCGM tests
  • IS_WINDOWS conditional timeout in test_request_cancellation (300s on Win, 120s elsewhere)
  • 0.5% tolerance on 1000-concurrency stress assertion (VDI shutdown drops 1–5 in-flight requests)
  • New unit tests for: bootstrap _remove_if_empty, SocketDefaults platform gating, harness shlex modes

Test plan

  • uv run pytest tests/unit/ — 10697 pass on macOS
  • uv run pytest tests/unit/transports/test_socket_defaults.py + tests/unit/common/test_bootstrap_macos.py — covers new platform gating
  • Integration suite on Windows VM (Will/Annamalai validating)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@82b0b193793c43c9ebd3431b3313017de82cadf1

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@82b0b193793c43c9ebd3431b3313017de82cadf1

Last updated for commit: 82b0b19Browse code

@github-actions github-actions Bot added the test label May 15, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 81.42857% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/orchestrator/subprocess_runner.py 18.18% 9 Missing ⚠️
src/aiperf/controller/system_controller.py 75.00% 1 Missing and 1 partial ⚠️
src/aiperf/common/base_service.py 50.00% 1 Missing ⚠️
src/aiperf/plot/dashboard/callbacks.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@lkomali lkomali force-pushed the lkomali/aiperf-windows-port branch from 4bb3ba9 to bff6081 Compare May 15, 2026 20:30
Comment thread src/aiperf/common/bootstrap.py Outdated
f"{tempfile.gettempdir()}{os.sep}"
f"aiperf_child_{os.getpid()}_{uuid.uuid4().hex[:8]}_stderr.log"
)
err_fd = os.open(err_path, os.O_WRONLY | os.O_CREAT | os.O_TRUNC)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating the stderr crash log with os.open(..., O_CREAT, ...) and no mode uses the default permissive file mode in a shared temp directory, which can expose tracebacks or secrets. Fix: create it with restrictive permissions such as 0o600 via os.open(err_path, flags, 0o600).

1: Benchmark failed (errors occurred)
Other: Unexpected error or signal
"""
if IS_WINDOWS:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This redirects Windows subprocess stdout and stderr to NUL unconditionally before validation and error handling, so normal sweep output and parent-captured diagnostics are silently lost. Fix: only release inherited handles when stdio is actually a non-TTY pipe and preserve stderr in the parent capture or a per-run log.

Copy link
Copy Markdown

@dynamo-ops dynamo-ops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous review comments (bootstrap.py file permissions, subprocess_runner.py NUL redirect) are still pending. No new high-confidence issues found in this revision — the Windows porting changes are well-reasoned with good test coverage.

…of windows-port

Signed-off-by: lkomali <lkomali@nvidia.com>
@lkomali lkomali force-pushed the lkomali/aiperf-windows-test-cleanup branch from 268ac11 to 5d4e35f Compare May 15, 2026 21:38
lkomali and others added 5 commits May 18, 2026 10:41
…x-only

Signed-off-by: lkomali <lkomali@nvidia.com>
… timeout multiplier

Signed-off-by: lkomali <lkomali@nvidia.com>
…nk/chmod gating, CSV encoding, path comparison, socket buffer expectations, and matplotlib backend

Signed-off-by: lkomali <lkomali@nvidia.com>
…r disclosure

Signed-off-by: lkomali <lkomali@nvidia.com>
… scheduler precision at high QPS

Signed-off-by: lkomali <lkomali@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants