[2/N] PTQ skill change for transformers 5.0 (#1229)

mxinO · web-flow · commit 82cf8515ebd2 · 2026-04-11T11:52:40.000+08:00
### What does this PR do? Type of change: Improve  <p style="white-space: pre-wrap; margin-top: 0.1em; margin-bottom: 0.2em; color: rgb(97, 97, 97); font-family: -apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, Roboto, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(242, 242, 242); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><strong>Summary:</strong></p><ul style="padding-inline-start: 2em; color: rgb(97, 97, 97); font-family: -apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, Roboto, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(242, 242, 242); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><li>Update MoE Pattern 2 for transformers 5.0 unified fused experts (<code style="font-family: monospace; color: rgb(163, 21, 21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: break-word; font-size: 0.9em;">_QuantFusedExperts</code><span> </span>auto-detection)</li><li>Add<span> </span><code style="font-family: monospace; color: rgb(163, 21, 21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: break-word; font-size: 0.9em;">PIP_CONSTRAINT</code><span> </span>workaround and<span> </span><code style="font-family: monospace; color: rgb(163, 21, 21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: break-word; font-size: 0.9em;">PYTHONPATH</code><span> </span>guidance for NGC containers</li><li>Add pip error diagnostic tip (<code style="font-family: monospace; color: rgb(163, 21, 21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: break-word; font-size: 0.9em;">ResolutionImpossible</code><span> </span>≠ network failure)</li><li>Remove duplicated warnings across files — single source of truth per topic</li></ul><p style="white-space: pre-wrap; margin-top: 0.1em; margin-bottom: 0.2em; color: rgb(97, 97, 97); font-family: -apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, Roboto, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(242, 242, 242); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><strong>Changes by file:</strong></p> <br class="Apple-interchange-newline"> File | Change -- | -- references/slurm-setup-ptq.md | Container dependency section: PYTHONPATH preferred, PIP_CONSTRAINT workaround, --no-deps fallback references/unsupported-models.md | MoE Pattern 2 updated for transformers 5.0 auto-detection. Pip install advice points to slurm-setup-ptq.md. Pip error diagnostic added SKILL.md | Common Pitfalls simplified — warnings point to references instead of duplicating ### Usage ### Testing Tested on gemma4 dense and MoE models. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Documentation** * Clarified Transformers-version checks (prefer config.json) and warned container upgrades can be blocked by PIP_CONSTRAINT; added pointer to remediation. * Shortened Docker/NFS guidance by cross-referencing setup docs instead of explicit commands. * Reworked SLURM/container workflow to prefer existing images and add an import → pull fallback. * Added in-job dependency remediation steps and clarified MoE auto-detection differences and pip conflict troubleshooting.  --------- Signed-off-by: Meng Xin <mxin@nvidia.com>
diff --git a/.claude/skills/ptq/SKILL.md b/.claude/skills/ptq/SKILL.md
@@ -124,9 +124,9 @@ Report the path and size to the user.
 
 ## Common Pitfalls
 
-- **Transformers version**: Newer models (e.g., Devstral/ministral3) may require a transformers version not yet in the container. Check `config.json` for `transformers_version` and upgrade if needed. Install ModelOpt first, then upgrade transformers **with** deps (not `--no-deps`) to pull compatible `huggingface_hub`
+- **Transformers version**: New models may need a newer version of transformers than what's installed. Check `config.json` for `transformers_version`. In containers, beware of `PIP_CONSTRAINT` blocking upgrades — see `references/slurm-setup-ptq.md` for workarounds
 - **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
-- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Use `docker run --user $(id -u):$(id -g)`, or `chmod -R a+rwX` on needed directories as a fallback. See `skills/common/slurm-setup.md` section 5
+- **NFS root_squash + Docker**: See `skills/common/slurm-setup.md` section 5
 
 ## References
 
diff --git a/.claude/skills/ptq/references/slurm-setup-ptq.md b/.claude/skills/ptq/references/slurm-setup-ptq.md
@@ -7,29 +7,54 @@ monitoring), see `skills/common/slurm-setup.md`.
 
 ## 1. Container
 
-Get the recommended image version from `examples/llm_ptq/README.md`, then look for a `.sqsh` file in the workspace and common sibling directories:
+Get the recommended image version from `examples/llm_ptq/README.md`, then look for an existing `.sqsh` file:
 
 ```bash
 ls *.sqsh ../*.sqsh ~/containers/*.sqsh 2>/dev/null
 ```
 
-If you find a `.sqsh` but aren't sure of its version, check it:
+**If a `.sqsh` exists**, use it directly with `--container-image=<path>`. Skip import.
+
+**If no `.sqsh` exists**, import with enroot (caches for subsequent smoke tests and reruns):
 
 ```bash
-srun --container-image=<path/to/container.sqsh> --ntasks=1 bash -c \
-    "pip show tensorrt-llm 2>/dev/null | grep Version || cat /VERSION 2>/dev/null || echo unknown"
+export ENROOT_CACHE_PATH=/path/to/writable/enroot-cache
+export ENROOT_DATA_PATH=/path/to/writable/enroot-data
+mkdir -p "$ENROOT_CACHE_PATH" "$ENROOT_DATA_PATH"
+enroot import --output /path/to/container.sqsh docker://nvcr.io#nvidia/tensorrt-llm/release:<version>
 ```
 
-If no `.sqsh` exists, import it with enroot. Set writable cache paths first — the default `/raid/containers` is often not writable:
+If enroot import fails (e.g., permission errors on lustre), use pyxis inline pull as fallback — pass the NGC URI directly to `--container-image="nvcr.io/nvidia/tensorrt-llm/release:<version>"`. Note this re-pulls on every job.
+
+### Container dependency pitfalls
+
+**New models may need newer transformers** than what's in the container:
 
 ```bash
-export ENROOT_CACHE_PATH=/path/to/writable/enroot-cache
-export ENROOT_DATA_PATH=/path/to/writable/enroot-data
-export TMPDIR=/path/to/writable/tmp
-mkdir -p "$ENROOT_CACHE_PATH" "$ENROOT_DATA_PATH" "$TMPDIR"
+pip install -U transformers
+```
+
+For unlisted models that need unreleased transformers (e.g., from git), see `references/unsupported-models.md` Step A.
+
+**Prefer `PYTHONPATH`** to use the synced ModelOpt source instead of installing inside the container — this avoids risking dependency conflicts (e.g., `pip install -U nvidia-modelopt[hf]` can upgrade PyTorch and break other packages):
+
+```bash
+export PYTHONPATH=/path/to/Model-Optimizer:$PYTHONPATH
+```
+
+If `PYTHONPATH` doesn't work due to missing compiled extensions, fall back to `pip install -e ".[hf]" --no-build-isolation` (run from the Model-Optimizer repo root).
+
+**Watch for pip dependency conflicts** — NGC containers set `PIP_CONSTRAINT` to pin versions, causing `ResolutionImpossible` errors. Unset it first so pip can resolve freely:
+
+```bash
+unset PIP_CONSTRAINT
+pip install -U transformers   # now upgrades and resolves with new deps included
+```
 
-enroot import --output /path/to/container.sqsh \
-    docker://nvcr.io#nvidia/tensorrt-llm/release:<version>
+If that still conflicts, fall back to `--no-deps` (skips new deps — may need to add missing ones manually):
+
+```bash
+pip install -U transformers --no-deps
 ```
 
 ---
@@ -68,10 +93,3 @@ This catches script errors cheaply before using GPU quota on a real run.
 See `skills/common/slurm-setup.md` section 2 for the smoke test partition pattern.
 
 Only submit the full calibration job after the smoke test exits cleanly.
-
----
-
-## 4. PTQ-Specific Notes
-
-- **Gated datasets**: Some calibration datasets (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Set `HF_TOKEN` in the job environment, or use `--dataset cnn_dailymail` to use a non-gated alternative.
-- **NFS permissions**: Docker + NFS root_squash causes `PermissionError` on output/cache dirs. See `skills/common/slurm-setup.md` section 5 for fixes.
diff --git a/.claude/skills/ptq/references/unsupported-models.md b/.claude/skills/ptq/references/unsupported-models.md
@@ -15,7 +15,11 @@ After download, inspect the model files on the target machine (use `remote_run`
 
 Write custom scripts locally (in `./workspaces/<model>/scripts/`), then sync to remote before running.
 
-**Then check `config.json`** (on the target machine):
+**Check transformers compatibility** (on the target machine):
+
+First, if README or `config.json` specifies a required transformers version, check if installed version satisfies it. If not, upgrade: `pip install -U "transformers>=<required_version>"`.
+
+Then try loading:
 
 ```bash
 python -c "
@@ -40,16 +44,14 @@ print(type(cfg).__name__)
 
   Read the modeling file and proceed to Step B.
 
-- **Raises `ValueError` / `OSError` (unknown architecture)** → not in the installed transformers. Determine why:
-
-  1. **Check the transformers `main` branch** (not yet released):
+- **Raises `ValueError` / `OSError` (unknown architecture)** → not in the installed transformers. Try `pip install -U transformers` first. If still not found, check the `main` branch:
 
      ```bash
      git clone --depth 1 https://github.com/huggingface/transformers.git /tmp/transformers-main --quiet
      grep -r "class <ArchName>" /tmp/transformers-main/src/transformers/models/
      ```
 
-     - **Found** → install with deps: `pip install /tmp/transformers-main`, then re-run `AutoConfig.from_pretrained()`. **Important**: if ModelOpt is already installed, its `[hf]` extras may have pinned an older transformers. Install ModelOpt first, then upgrade transformers **after** (with deps, not `--no-deps`) so compatible `huggingface_hub` and other transitive deps are pulled in.
+     - **Found** → `pip install /tmp/transformers-main`, then re-run `AutoConfig`.
      - **Not found** → ask the user: *"The checkpoint uses `<ArchName>` which isn't in released or main-branch transformers. Do you have a private fork or custom modeling code?"*
 
 - **No `config.json`** → not a standard HF checkpoint. List the directory for README or `.py` files. If nothing useful, ask the user for the modeling code.
@@ -131,13 +133,15 @@ class QuantCustomModule(OriginalModule):
 
 ## Pattern 2: MoE Models
 
-**Standard MoE** (per-expert `nn.Linear` in a `ModuleList` with `gate` + `experts`): Auto-detected by `register_sparse_moe_on_the_fly`. No custom code needed — amax sync and calibration coverage are handled automatically.
+**Most MoE models are auto-detected** — ModelOpt handles two common patterns automatically:
+
+- **transformers >= 5.0**: Unified fused experts (`gate_up_proj` + `down_proj` 3D tensors) → auto-detected by `register_fused_experts_on_the_fly`, handled by `_QuantFusedExperts`. Covers Mixtral, Qwen, DeepSeek, Jamba, OlMoE, etc.
+- **transformers < 5.0**: Sequential per-expert `nn.Linear` with `gate` + `experts` → auto-detected by `register_sparse_moe_on_the_fly`.
 
-**Custom MoE** requires patching. Read the model source to understand how expert weights are stored and computed, then find the closest pattern in the plugin (`modelopt/torch/quantization/plugins/huggingface.py`):
+**Custom MoE** (non-standard layout not matching auto-detection) requires patching. Find the closest pattern in the plugin (`modelopt/torch/quantization/plugins/huggingface.py`):
 
 | MoE design | Strategy | Plugin example |
 | --- | --- | --- |
-| Fused weights + per-expert dispatch loop | Expand to per-expert `nn.Linear` | `_QuantQwen35MoeExperts` |
 | Fused weights + `torch.bmm` | Add `TensorQuantizer` around bmm | `_QuantLlama4TextExperts` |
 | Fused weights + functional interception | Intercept matmul ops | `_QuantGptOssExperts` |
 | Fused 2D weights (experts stacked in rows) | Two-level expansion | `_QuantDbrxExpertGLU` |
@@ -343,3 +347,4 @@ tokenizer.save_pretrained(output_path)
 - **Check quantizer summary**: `mtq.print_quant_summary(model)` shows which quantizers are enabled/disabled
 - **Inspect dtypes**: After loading, iterate `model.named_parameters()` and check for unexpected FP8 tensors
 - **Watch for silent disabling**: A misconfigured wildcard pattern can silently disable quantizers — always verify the summary
+- **Read pip errors carefully**: `ResolutionImpossible` means dependency conflict (try `--no-deps`), NOT network failure. Check for `Connection refused`/`Name resolution failed` before concluding network is down