Skip to content

[XPU] Expand Stage B with proven AMD/CUDA tests#25543

Open
arathi-hlab wants to merge 4 commits into
sgl-project:mainfrom
arathi-hlab:feat/xpu-stage-c
Open

[XPU] Expand Stage B with proven AMD/CUDA tests#25543
arathi-hlab wants to merge 4 commits into
sgl-project:mainfrom
arathi-hlab:feat/xpu-stage-c

Conversation

@arathi-hlab
Copy link
Copy Markdown

@arathi-hlab arathi-hlab commented May 17, 2026

Summary

Expands XPU CI Stage B with additional proven tests from AMD/CUDA CI suites:

  • test_torch_native_attention_xpu.py - Tests torch native attention backend with MMLU benchmark (140s)
  • test_hidden_states_xpu.py - Tests hidden states extraction API (45s)

Both tests are:

  1. Part of AMD/CUDA Stage B per-commit suites
  2. Proven PASSING in WW21 HTML report
  3. Adapted with XPU-specific configurations (device="xpu", mem-fraction-static=0.7, disable-radix-cache)

Also removes Stage C infrastructure (no suitable 2-GPU tests found in AMD/CUDA that are passing).

Test Plan

Total CI time: ~55 minutes (Stage A: 22 min, Stage B: 33 min)

Dependencies

Depends on #25405 (XPU basic CI infrastructure)

🤖 Generated with Claude Code


CI States

Latest PR Test (Base): ❌ Missing run-ci label — add it to run CI tests.
Latest PR Test (Extra): ❌ Blockedrun-ci is required first.

vshekhawat-hlab and others added 3 commits May 15, 2026 05:54
Add register_xpu_ci() to the CI registry system and migrate existing
XPU tests from test/srt/xpu/ to test/registered/xpu/, aligning XPU CI
architecture with AMD and Nvidia.

Changes:
- python/sglang/test/ci/ci_register.py: add HWBackend.XPU,
  register_xpu_ci() function, and REGISTER_MAPPING entry
- test/run_suite.py: add "xpu" to HW_MAPPING, XPU suites to
  PER_COMMIT_SUITES (stage-a-test-1-gpu-xpu, stage-b-test-1-gpu-xpu),
  XPU to _SUITE_CHECKED_BACKENDS
- test/srt/run_suite.py: clear legacy suite_xpu dict (tests now
  registered via register_xpu_ci)
- test/registered/xpu/: move tests from test/srt/xpu/ and add
  register_xpu_ci() decorators with est_time and suite assignments
- .github/workflows/pr-test-xpu.yml: replace single flat job with
  3-stage pipeline (stage-a → wait → stage-b) matching AMD/Nvidia
  structure; run_suite.py --hw xpu replaces hardcoded suite invocation

Adding a new XPU test now only requires creating a file in
test/registered/xpu/ with register_xpu_ci(est_time=N, suite=...) --
no workflow changes needed.
Expand XPU Stage B test coverage using tests proven passing in WW21
HTML report from AMD/CUDA Stage B suites.

Hardware: 2x BMG (Battlemage) GPUs.

**Test Selection Methodology:**
All Stage B tests are selected from AMD+CUDA Stage B tests that are
PASSING in HTML report, ensuring proven reliability.

**Stage A (Smoke - 1 GPU, ~120s)**
- test_xpu_basic.py (120s) - Quick validation gate

**Stage B (Functional - 1 GPU, ~785s)**
- test_intel_xpu_backend.py (600s) - ✅ PROVEN PASSING (3 tests in HTML)
  * test_latency_qwen_model
  * test_attention_backend
  * test_mla_decode_attention_backend
- test_torch_native_attention_xpu.py (140s) - ✅ Passing in HTML
  * MMLU benchmark with torch native attention
  * Adapted from test/registered/attention/test_torch_native_attention_backend.py
  * AMD Stage B (150s), CUDA Stage B (140s)
- test_hidden_states_xpu.py (45s) - ✅ Passing in HTML
  * Hidden states extraction API
  * Adapted from test/registered/core/test_hidden_states.py
  * AMD Stage B (55s), CUDA Stage B (45s)

**Stage C**:
Skipped for this PR. No suitable AMD/CUDA Stage C tests exist that:
- Run on 2 GPUs (all require 4+ GPUs)
- Are model inference tests (not unit tests)
- Are passing in HTML report
Will add Stage C in future PR after infrastructure validation.

**Workflow changes (.github/workflows/pr-test-xpu.yml)**:
- Stage A and B jobs remain unchanged
- No Stage C jobs added

**Test infrastructure (test/run_suite.py)**:
- XPU suites: stage-a-test-1-gpu-xpu, stage-b-test-1-gpu-xpu

**Models used (verified passing)**:
- Llama-3.1-8B-Instruct (DEFAULT_MODEL_NAME_FOR_TEST)
- Llama-3.2-1B-Instruct (DEFAULT_SMALL_MODEL_NAME_FOR_TEST)

**Total CI time: ~55 minutes**
- Stage A: ~22 min (20 min build + 2 min test)
- Stage B: ~33 min (20 min build + 13 min test)

Depends on PR sgl-project#25405 (XPU registry system).
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates Intel XPU tests to a registry-based CI system by introducing the register_xpu_ci function and updating the HWBackend enumeration. New test cases for hidden states, torch native attention, and basic generation on XPU were added, while existing tests were updated to use the new registration mechanism. Feedback was provided to use idiomatic unittest assertions instead of raw assert statements to improve error reporting in CI.

)()

metrics = run_eval(args)
assert metrics["accuracy"] >= 0.5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use self.assertGreaterEqual instead of a raw assert statement. This is more idiomatic when using unittest.TestCase and provides a more descriptive error message if the assertion fails, which is helpful for debugging CI failures.

Suggested change
assert metrics["accuracy"] >= 0.5
self.assertGreaterEqual(metrics["accuracy"], 0.5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants