[qwen3_5_moe][ci] Track export GPU peak memory and gate it in CI

Gasoonjia · Gasoonjia · commit b3d6df2e1821 · 2026-04-29T00:40:20.000-07:00
## Summary

Add a GPU memory regression guard so that the Qwen3.5 MoE export keeps
fitting on consumer-grade 24 GB GPUs (RTX 4090 / 3090 / A5000 …).

## What this diff does

1. `examples/models/qwen3_5_moe/export.py`
   - Reset CUDA peak memory stats at the start of the CUDA backend setup.
   - At the end of `main()`, when running with `--backend cuda`, print a
     stable, machine-parseable marker line:
       `EXPORT_GPU_PEAK_MEMORY_MB: &lt;peak_in_MB&gt;`
     This makes the actual peak GPU memory consumed by the entire
     load + quantize + lower pipeline visible to both humans and CI.

2. `.ci/scripts/export_model_artifact.sh` (qwen3_5_moe path)
   - Tee the export output to a temp log.
   - Grep the `EXPORT_GPU_PEAK_MEMORY_MB` marker and compare against
     `EXPORT_GPU_PEAK_MB_LIMIT` (default 20480 MB = 20 GB; overridable
     via env var).
   - Fail the job with an explanatory error if the budget is exceeded,
     so any future regression that reintroduces the ~18 GB unnecessary
     GPU clone (or comparable leak) is caught at PR time rather than
     silently breaking 24 GB-class GPUs.

## Notes

- Current measured peak with the CUDA backend memory fixes (see prior
  commit on this branch) is ~18 GB, leaving ~2 GB headroom under the
  20 GB limit. Without those fixes the peak shoots to ~37 GB and CI
  will fail loudly.
- The threshold is intentionally tighter than the 24 GB physical cap
  to leave room for measurement noise and small allocator overhead.

## Test Plan

- Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export
  --prequantized &lt;hqq-int4-bundle&gt; --backend cuda` and confirmed the
  marker line is printed at the end with a sensible value (~18 GB).
- Manual: simulated CI gate logic locally with the marker line and
  confirmed both the success path and the failure path (forced
  threshold below the actual peak) behave as expected.
diff --git a/.ci/scripts/export_model_artifact.sh b/.ci/scripts/export_model_artifact.sh
@@ -415,12 +415,38 @@ if [ "$MODEL_NAME" = "qwen3_5_moe" ]; then
 
   # Export to .pte/.ptd (short cache dir avoids objcopy symbol length issues)
   echo "::group::Export"
+  EXPORT_LOG=$(mktemp)
   TORCHINDUCTOR_CACHE_DIR="$INDUCTOR_CACHE" \
   python -m executorch.examples.models.qwen3_5_moe.export \
       --prequantized "$LOCAL_MODEL_DIR" \
-      --output-dir "${OUTPUT_DIR}"
+      --output-dir "${OUTPUT_DIR}" 2>&1 | tee "$EXPORT_LOG"
+  EXPORT_RC=${PIPESTATUS[0]}
   echo "::endgroup::"
 
+  if [ "$EXPORT_RC" -ne 0 ]; then
+    echo "ERROR: Qwen3.5 MoE export failed (exit $EXPORT_RC)"
+    rm -f "$EXPORT_LOG"
+    exit "$EXPORT_RC"
+  fi
+
+  # Gate peak GPU memory so we keep the export viable on consumer GPUs
+  # (e.g. RTX 4090 with 24 GB). The export script prints a machine-
+  # parseable marker line "EXPORT_GPU_PEAK_MEMORY_MB: <float>".
+  EXPORT_GPU_PEAK_MB_LIMIT="${EXPORT_GPU_PEAK_MB_LIMIT:-20480}"
+  PEAK_LINE=$(grep -E '^EXPORT_GPU_PEAK_MEMORY_MB:' "$EXPORT_LOG" | tail -1)
+  rm -f "$EXPORT_LOG"
+  if [ -z "$PEAK_LINE" ]; then
+    echo "ERROR: export did not emit EXPORT_GPU_PEAK_MEMORY_MB marker; cannot enforce GPU memory budget"
+    exit 1
+  fi
+  PEAK_MB=$(echo "$PEAK_LINE" | awk '{print $2}')
+  echo "Export GPU peak memory: ${PEAK_MB} MB (limit ${EXPORT_GPU_PEAK_MB_LIMIT} MB)"
+  if awk -v p="$PEAK_MB" -v l="$EXPORT_GPU_PEAK_MB_LIMIT" 'BEGIN{exit !(p>l)}'; then
+    echo "ERROR: export exceeded GPU memory budget (${PEAK_MB} MB > ${EXPORT_GPU_PEAK_MB_LIMIT} MB)"
+    echo "       — this would prevent the model from being exported on a 24 GB consumer GPU."
+    exit 1
+  fi
+
   test -f "${OUTPUT_DIR}/model.pte"
   test -f "${OUTPUT_DIR}/aoti_cuda_blob.ptd"
   ls -al "${OUTPUT_DIR}"
diff --git a/examples/models/qwen3_5_moe/export.py b/examples/models/qwen3_5_moe/export.py
@@ -967,6 +967,13 @@ def main():  # noqa: C901
         # Register FLA Triton kernel (CUDA only)
         import executorch.backends.cuda.triton.kernels  # noqa: F401
 
+        # Reset peak GPU memory stats so we can report the actual peak
+        # consumed during the export pipeline (load + quantize + lowering)
+        # at the very end. This is also gated by CI to make sure low-VRAM
+        # GPUs (e.g. RTX 4090, 24 GB) can still complete the export.
+        if torch.cuda.is_available():
+            torch.cuda.reset_peak_memory_stats(0)
+
     if args.backend == "mlx":
         if args.prequantized:
             parser.error("--prequantized is not supported with --backend mlx")
@@ -989,6 +996,13 @@ def main():  # noqa: C901
 
     export_and_lower(model, config, args)
 
+    # Report peak GPU memory consumed during the export so CI / users can
+    # gate this against a known budget (e.g. 24 GB consumer GPUs).
+    if args.backend == "cuda" and torch.cuda.is_available():
+        peak_mb = torch.cuda.max_memory_allocated(0) / (1024 * 1024)
+        # Stable, machine-parseable marker for CI grep.
+        print(f"EXPORT_GPU_PEAK_MEMORY_MB: {peak_mb:.2f}")
+
 
 if __name__ == "__main__":
     main()