Skip to content

Commit 82cf851

Browse files
authored
[2/N] PTQ skill change for transformers 5.0 (#1229)
### What does this PR do? Type of change: Improve <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <p style="white-space: pre-wrap; margin-top: 0.1em; margin-bottom: 0.2em; color: rgb(97, 97, 97); font-family: -apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, Roboto, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(242, 242, 242); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><strong>Summary:</strong></p><ul style="padding-inline-start: 2em; color: rgb(97, 97, 97); font-family: -apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, Roboto, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(242, 242, 242); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><li>Update MoE Pattern 2 for transformers 5.0 unified fused experts (<code style="font-family: monospace; color: rgb(163, 21, 21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: break-word; font-size: 0.9em;">_QuantFusedExperts</code><span> </span>auto-detection)</li><li>Add<span> </span><code style="font-family: monospace; color: rgb(163, 21, 21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: break-word; font-size: 0.9em;">PIP_CONSTRAINT</code><span> </span>workaround and<span> </span><code style="font-family: monospace; color: rgb(163, 21, 21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: break-word; font-size: 0.9em;">PYTHONPATH</code><span> </span>guidance for NGC containers</li><li>Add pip error diagnostic tip (<code style="font-family: monospace; color: rgb(163, 21, 21); background-color: rgba(0, 0, 0, 0.1); padding: 2px 4px; border-radius: 3px; word-break: break-word; font-size: 0.9em;">ResolutionImpossible</code><span> </span>≠ network failure)</li><li>Remove duplicated warnings across files — single source of truth per topic</li></ul><p style="white-space: pre-wrap; margin-top: 0.1em; margin-bottom: 0.2em; color: rgb(97, 97, 97); font-family: -apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, Roboto, sans-serif; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(242, 242, 242); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><strong>Changes by file:</strong></p> <br class="Apple-interchange-newline"> File | Change -- | -- references/slurm-setup-ptq.md | Container dependency section: PYTHONPATH preferred, PIP_CONSTRAINT workaround, --no-deps fallback references/unsupported-models.md | MoE Pattern 2 updated for transformers 5.0 auto-detection. Pip install advice points to slurm-setup-ptq.md. Pip error diagnostic added SKILL.md | Common Pitfalls simplified — warnings point to references instead of duplicating ### Usage ### Testing Tested on gemma4 dense and MoE models. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Clarified Transformers-version checks (prefer config.json) and warned container upgrades can be blocked by PIP_CONSTRAINT; added pointer to remediation. * Shortened Docker/NFS guidance by cross-referencing setup docs instead of explicit commands. * Reworked SLURM/container workflow to prefer existing images and add an import → pull fallback. * Added in-job dependency remediation steps and clarified MoE auto-detection differences and pip conflict troubleshooting. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Meng Xin <mxin@nvidia.com>
1 parent 9050188 commit 82cf851

File tree

3 files changed

+51
-28
lines changed

3 files changed

+51
-28
lines changed

.claude/skills/ptq/SKILL.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,9 +124,9 @@ Report the path and size to the user.
124124

125125
## Common Pitfalls
126126

127-
- **Transformers version**: Newer models (e.g., Devstral/ministral3) may require a transformers version not yet in the container. Check `config.json` for `transformers_version` and upgrade if needed. Install ModelOpt first, then upgrade transformers **with** deps (not `--no-deps`) to pull compatible `huggingface_hub`
127+
- **Transformers version**: New models may need a newer version of transformers than what's installed. Check `config.json` for `transformers_version`. In containers, beware of `PIP_CONSTRAINT` blocking upgrades — see `references/slurm-setup-ptq.md` for workarounds
128128
- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
129-
- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Use `docker run --user $(id -u):$(id -g)`, or `chmod -R a+rwX` on needed directories as a fallback. See `skills/common/slurm-setup.md` section 5
129+
- **NFS root_squash + Docker**: See `skills/common/slurm-setup.md` section 5
130130

131131
## References
132132

.claude/skills/ptq/references/slurm-setup-ptq.md

Lines changed: 36 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -7,29 +7,54 @@ monitoring), see `skills/common/slurm-setup.md`.
77

88
## 1. Container
99

10-
Get the recommended image version from `examples/llm_ptq/README.md`, then look for a `.sqsh` file in the workspace and common sibling directories:
10+
Get the recommended image version from `examples/llm_ptq/README.md`, then look for an existing `.sqsh` file:
1111

1212
```bash
1313
ls *.sqsh ../*.sqsh ~/containers/*.sqsh 2>/dev/null
1414
```
1515

16-
If you find a `.sqsh` but aren't sure of its version, check it:
16+
**If a `.sqsh` exists**, use it directly with `--container-image=<path>`. Skip import.
17+
18+
**If no `.sqsh` exists**, import with enroot (caches for subsequent smoke tests and reruns):
1719

1820
```bash
19-
srun --container-image=<path/to/container.sqsh> --ntasks=1 bash -c \
20-
"pip show tensorrt-llm 2>/dev/null | grep Version || cat /VERSION 2>/dev/null || echo unknown"
21+
export ENROOT_CACHE_PATH=/path/to/writable/enroot-cache
22+
export ENROOT_DATA_PATH=/path/to/writable/enroot-data
23+
mkdir -p "$ENROOT_CACHE_PATH" "$ENROOT_DATA_PATH"
24+
enroot import --output /path/to/container.sqsh docker://nvcr.io#nvidia/tensorrt-llm/release:<version>
2125
```
2226

23-
If no `.sqsh` exists, import it with enroot. Set writable cache paths first — the default `/raid/containers` is often not writable:
27+
If enroot import fails (e.g., permission errors on lustre), use pyxis inline pull as fallback — pass the NGC URI directly to `--container-image="nvcr.io/nvidia/tensorrt-llm/release:<version>"`. Note this re-pulls on every job.
28+
29+
### Container dependency pitfalls
30+
31+
**New models may need newer transformers** than what's in the container:
2432

2533
```bash
26-
export ENROOT_CACHE_PATH=/path/to/writable/enroot-cache
27-
export ENROOT_DATA_PATH=/path/to/writable/enroot-data
28-
export TMPDIR=/path/to/writable/tmp
29-
mkdir -p "$ENROOT_CACHE_PATH" "$ENROOT_DATA_PATH" "$TMPDIR"
34+
pip install -U transformers
35+
```
36+
37+
For unlisted models that need unreleased transformers (e.g., from git), see `references/unsupported-models.md` Step A.
38+
39+
**Prefer `PYTHONPATH`** to use the synced ModelOpt source instead of installing inside the container — this avoids risking dependency conflicts (e.g., `pip install -U nvidia-modelopt[hf]` can upgrade PyTorch and break other packages):
40+
41+
```bash
42+
export PYTHONPATH=/path/to/Model-Optimizer:$PYTHONPATH
43+
```
44+
45+
If `PYTHONPATH` doesn't work due to missing compiled extensions, fall back to `pip install -e ".[hf]" --no-build-isolation` (run from the Model-Optimizer repo root).
46+
47+
**Watch for pip dependency conflicts** — NGC containers set `PIP_CONSTRAINT` to pin versions, causing `ResolutionImpossible` errors. Unset it first so pip can resolve freely:
48+
49+
```bash
50+
unset PIP_CONSTRAINT
51+
pip install -U transformers # now upgrades and resolves with new deps included
52+
```
3053

31-
enroot import --output /path/to/container.sqsh \
32-
docker://nvcr.io#nvidia/tensorrt-llm/release:<version>
54+
If that still conflicts, fall back to `--no-deps` (skips new deps — may need to add missing ones manually):
55+
56+
```bash
57+
pip install -U transformers --no-deps
3358
```
3459

3560
---
@@ -68,10 +93,3 @@ This catches script errors cheaply before using GPU quota on a real run.
6893
See `skills/common/slurm-setup.md` section 2 for the smoke test partition pattern.
6994

7095
Only submit the full calibration job after the smoke test exits cleanly.
71-
72-
---
73-
74-
## 4. PTQ-Specific Notes
75-
76-
- **Gated datasets**: Some calibration datasets (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Set `HF_TOKEN` in the job environment, or use `--dataset cnn_dailymail` to use a non-gated alternative.
77-
- **NFS permissions**: Docker + NFS root_squash causes `PermissionError` on output/cache dirs. See `skills/common/slurm-setup.md` section 5 for fixes.

.claude/skills/ptq/references/unsupported-models.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,11 @@ After download, inspect the model files on the target machine (use `remote_run`
1515

1616
Write custom scripts locally (in `./workspaces/<model>/scripts/`), then sync to remote before running.
1717

18-
**Then check `config.json`** (on the target machine):
18+
**Check transformers compatibility** (on the target machine):
19+
20+
First, if README or `config.json` specifies a required transformers version, check if installed version satisfies it. If not, upgrade: `pip install -U "transformers>=<required_version>"`.
21+
22+
Then try loading:
1923

2024
```bash
2125
python -c "
@@ -40,16 +44,14 @@ print(type(cfg).__name__)
4044

4145
Read the modeling file and proceed to Step B.
4246

43-
- **Raises `ValueError` / `OSError` (unknown architecture)** → not in the installed transformers. Determine why:
44-
45-
1. **Check the transformers `main` branch** (not yet released):
47+
- **Raises `ValueError` / `OSError` (unknown architecture)** → not in the installed transformers. Try `pip install -U transformers` first. If still not found, check the `main` branch:
4648

4749
```bash
4850
git clone --depth 1 https://github.com/huggingface/transformers.git /tmp/transformers-main --quiet
4951
grep -r "class <ArchName>" /tmp/transformers-main/src/transformers/models/
5052
```
5153

52-
- **Found**install with deps: `pip install /tmp/transformers-main`, then re-run `AutoConfig.from_pretrained()`. **Important**: if ModelOpt is already installed, its `[hf]` extras may have pinned an older transformers. Install ModelOpt first, then upgrade transformers **after** (with deps, not `--no-deps`) so compatible `huggingface_hub` and other transitive deps are pulled in.
54+
- **Found**`pip install /tmp/transformers-main`, then re-run `AutoConfig`.
5355
- **Not found** → ask the user: *"The checkpoint uses `<ArchName>` which isn't in released or main-branch transformers. Do you have a private fork or custom modeling code?"*
5456

5557
- **No `config.json`** → not a standard HF checkpoint. List the directory for README or `.py` files. If nothing useful, ask the user for the modeling code.
@@ -131,13 +133,15 @@ class QuantCustomModule(OriginalModule):
131133
132134
## Pattern 2: MoE Models
133135
134-
**Standard MoE** (per-expert `nn.Linear` in a `ModuleList` with `gate` + `experts`): Auto-detected by `register_sparse_moe_on_the_fly`. No custom code needed — amax sync and calibration coverage are handled automatically.
136+
**Most MoE models are auto-detected** — ModelOpt handles two common patterns automatically:
137+
138+
- **transformers >= 5.0**: Unified fused experts (`gate_up_proj` + `down_proj` 3D tensors) → auto-detected by `register_fused_experts_on_the_fly`, handled by `_QuantFusedExperts`. Covers Mixtral, Qwen, DeepSeek, Jamba, OlMoE, etc.
139+
- **transformers < 5.0**: Sequential per-expert `nn.Linear` with `gate` + `experts` → auto-detected by `register_sparse_moe_on_the_fly`.
135140
136-
**Custom MoE** requires patching. Read the model source to understand how expert weights are stored and computed, then find the closest pattern in the plugin (`modelopt/torch/quantization/plugins/huggingface.py`):
141+
**Custom MoE** (non-standard layout not matching auto-detection) requires patching. Find the closest pattern in the plugin (`modelopt/torch/quantization/plugins/huggingface.py`):
137142
138143
| MoE design | Strategy | Plugin example |
139144
| --- | --- | --- |
140-
| Fused weights + per-expert dispatch loop | Expand to per-expert `nn.Linear` | `_QuantQwen35MoeExperts` |
141145
| Fused weights + `torch.bmm` | Add `TensorQuantizer` around bmm | `_QuantLlama4TextExperts` |
142146
| Fused weights + functional interception | Intercept matmul ops | `_QuantGptOssExperts` |
143147
| Fused 2D weights (experts stacked in rows) | Two-level expansion | `_QuantDbrxExpertGLU` |
@@ -343,3 +347,4 @@ tokenizer.save_pretrained(output_path)
343347
- **Check quantizer summary**: `mtq.print_quant_summary(model)` shows which quantizers are enabled/disabled
344348
- **Inspect dtypes**: After loading, iterate `model.named_parameters()` and check for unexpected FP8 tensors
345349
- **Watch for silent disabling**: A misconfigured wildcard pattern can silently disable quantizers — always verify the summary
350+
- **Read pip errors carefully**: `ResolutionImpossible` means dependency conflict (try `--no-deps`), NOT network failure. Check for `Connection refused`/`Name resolution failed` before concluding network is down

0 commit comments

Comments
 (0)