Skip to content

[FlyDSL AOT] Skip kernels for unrequested arches when GPU_ARCHS is set#3321

Open
eppaneamd wants to merge 8 commits into
mainfrom
feat/flydsl-aot-gpu-archs-filter
Open

[FlyDSL AOT] Skip kernels for unrequested arches when GPU_ARCHS is set#3321
eppaneamd wants to merge 8 commits into
mainfrom
feat/flydsl-aot-gpu-archs-filter

Conversation

@eppaneamd
Copy link
Copy Markdown
Contributor

Summary

When GPU_ARCHS is set to a specific arch at build time (e.g. gfx942), FlyDSL AOT previously compiled all kernels unconditionally, including hundreds of kernels tuned for other arches (that will not be used).

  • Add _job_arch(job) helper that returns the target arch from any job dict (cu_numcu_num_to_arch for GEMM/MoE; explicit "arch" field for CHUNK_GDN_H; None for untuned/arch-agnostic jobs that must always compile).
  • In start_aot(), filter all_jobs against GPU_ARCHS after collection. Uses _parse_gpu_archs_env from build_targets.py (;-separated, consistent with the rest of the codebase). Import is deferred inside the branch to avoid triggering aiter/__init__ during setup.py's early import of common.py.
  • GPU_ARCHS unset or "native" preserves existing behaviour.

Test plan

  • Unit-tested _job_arch and filter logic: tuned GEMM/MoE, CHUNK_GDN_H explicit arch, untuned jobs, single arch, multi-arch (gfx942;gfx950), unset, and native.
  • Verified end-to-end on AITER main with GPU_ARCHS=gfx942: filter fires correctly, 2001 gfx950-only kernels skipped, build completes without errors.

@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3321 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes FlyDSL AOT compilation by honoring GPU_ARCHS (when explicitly set) to avoid compiling tuned kernels for architectures that won’t be used, reducing build time and work.

Changes:

  • Added a _job_arch(job) helper to derive a target arch from different FlyDSL AOT job shapes (cu_num-based vs explicit "arch").
  • Updated start_aot() to filter the collected AOT jobs against GPU_ARCHS (excluding "native"), and emit a summary of skipped kernels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +323 to +341
gpu_archs_env = os.environ.get("GPU_ARCHS", "").strip()
if gpu_archs_env and gpu_archs_env.lower() != "native":
from aiter.jit.utils.build_targets import _parse_gpu_archs_env

requested = set(_parse_gpu_archs_env(gpu_archs_env))
before = len(all_jobs)
all_jobs = [
(kind, job)
for kind, job in all_jobs
if (arch := _job_arch(job)) is None or arch in requested
]
filtered = before - len(all_jobs)
if filtered:
print(
f"[aiter] FlyDSL AOT: GPU_ARCHS={gpu_archs_env!r} skipped"
f" {filtered} kernels for unrequested arches"
f" ({len(all_jobs)} remaining)"
)

@coderfeli coderfeli requested a review from zhiding512 May 25, 2026 01:29
@coderfeli
Copy link
Copy Markdown
Collaborator

@zhiding512 take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants