Skip to content

Commit b98f373

Browse files
Merge branch 'feature/puzzletron' into jrausch/distillation-consolidation
2 parents 6abc8ab + 0d6eb7e commit b98f373

229 files changed

Lines changed: 7283 additions & 5484 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/common/slurm-setup.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,47 @@ include a multi-node-capable partition as the last fallback.
7474

7575
Only submit the full job after the smoke test exits cleanly.
7676

77+
### Docker (non-pyxis) variant
78+
79+
Some clusters don't have pyxis/enroot installed and instead use plain `docker run` on compute nodes. In this case, replace the `srun --container-image` pattern with `docker run` inside the job script:
80+
81+
```bash
82+
#!/bin/bash
83+
#SBATCH --job-name=<name>
84+
#SBATCH --account=<account>
85+
#SBATCH --partition=<partition>
86+
#SBATCH --nodes=1
87+
#SBATCH --ntasks-per-node=1
88+
#SBATCH --gpus-per-node=<N>
89+
#SBATCH --time=<HH:MM:SS>
90+
#SBATCH --output=<log_dir>/<name>_%j.log
91+
92+
docker run --rm \
93+
--gpus all \
94+
--shm-size=32g \
95+
--ulimit memlock=-1 \
96+
--network host \
97+
-v <data_root>:<data_root> \
98+
-e CALIB_SIZE="${CALIB_SIZE:-512}" \
99+
<container_image> \
100+
bash <path/to/run_script.sh>
101+
```
102+
103+
**Key differences from pyxis**:
104+
105+
- No `srun` wrapper needed — SLURM just allocates the node, Docker runs the container
106+
- Mount paths with `-v` instead of `--container-mounts`
107+
- Pass env vars with `-e` instead of relying on SLURM env propagation
108+
- Use the two-script pattern: SLURM wrapper (sbatch directives + `docker run`) and inner runner (the actual work). The inner runner should unset SLURM env vars and set `HF_HOME`/`HF_DATASETS_OFFLINE` as needed
109+
- **NFS root_squash**: see section 5
110+
111+
**How to detect which pattern to use**: Ask the user how they normally run containers, or check:
112+
113+
```bash
114+
which enroot 2>/dev/null && echo "pyxis/enroot available"
115+
which docker 2>/dev/null && echo "docker available"
116+
```
117+
77118
---
78119

79120
## 3. Monitor Until Completion
@@ -126,3 +167,28 @@ srun \
126167
```
127168

128169
Adjust `--nodes`, `--gpus-per-node`, and the distributed launch command per your workload.
170+
171+
---
172+
173+
## 5. NFS root_squash and Docker Permissions
174+
175+
Docker containers typically run as root, but NFS filesystems with `root_squash` (the default) map root to `nobody`, blocking writes to directories owned by the user. This causes `PermissionError` when creating cache lock files, writing output, or saving logs.
176+
177+
This affects both pyxis/enroot (`srun --container-image`) and plain `docker run` workflows.
178+
179+
**Preferred fix** — run Docker with the host user's UID/GID to match NFS ownership:
180+
181+
```bash
182+
docker run --user $(id -u):$(id -g) ...
183+
```
184+
185+
> Note: `--user` may cause issues if the container expects root for package installation. In that case, fall back to the chmod approach below.
186+
187+
**Fallback fix** — open permissions before submitting the job:
188+
189+
```bash
190+
chmod -R g+rwX /path/to/workspace/
191+
chmod -R g+rwX /path/to/.hf_cache/
192+
```
193+
194+
Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.

.claude/skills/ptq/SKILL.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,10 +118,16 @@ Report the path and size to the user.
118118
- `mtq.register()` classes **must** define `_setup()` and call it from `__init__`
119119
- Call `mto.enable_huggingface_checkpointing()` **before** quantization
120120
- Wildcard `*gate*` matches too broadly — use `*mlp.gate*` or `*router*`
121-
- VLMs need `AutoModel`, not `AutoModelForCausalLM`
122-
- FP8 loading: `FineGrainedFP8Config(dequantize=True)`, not a dict
121+
- VLMs: `hf_ptq.py` auto-extracts the language model via `extract_and_prepare_language_model_from_vl()` — no manual VLM handling needed in most cases
122+
- FP8 checkpoints: prefer `_QuantFP8Linear` (lazy dequant) over `FineGrainedFP8Config(dequantize=True)` which wastes ~2x memory. See `references/unsupported-models.md` for details
123123
- Custom quantizer names must end with `_input_quantizer` or `_weight_quantizer`
124124

125+
## Common Pitfalls
126+
127+
- **Transformers version**: Newer models (e.g., Devstral/ministral3) may require a transformers version not yet in the container. Check `config.json` for `transformers_version` and upgrade if needed. Install ModelOpt first, then upgrade transformers **with** deps (not `--no-deps`) to pull compatible `huggingface_hub`
128+
- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
129+
- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Use `docker run --user $(id -u):$(id -g)`, or `chmod -R a+rwX` on needed directories as a fallback. See `skills/common/slurm-setup.md` section 5
130+
125131
## References
126132

127133
| Reference | When to read |
@@ -133,7 +139,7 @@ Report the path and size to the user.
133139
| `references/unsupported-models.md` | Step 4C only (unlisted model) |
134140
| `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote |
135141
| `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) |
136-
| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, FSDP2) |
142+
| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
137143
| `examples/llm_ptq/README.md` | Step 3: support matrix, CLI flags, accuracy |
138144
| `modelopt/torch/quantization/config.py` | Step 3: format definitions |
139145
| `modelopt/torch/export/model_utils.py` | Step 4C: TRT-LLM export type mapping |

.claude/skills/ptq/references/slurm-setup-ptq.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,3 +68,10 @@ This catches script errors cheaply before using GPU quota on a real run.
6868
See `skills/common/slurm-setup.md` section 2 for the smoke test partition pattern.
6969

7070
Only submit the full calibration job after the smoke test exits cleanly.
71+
72+
---
73+
74+
## 4. PTQ-Specific Notes
75+
76+
- **Gated datasets**: Some calibration datasets (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Set `HF_TOKEN` in the job environment, or use `--dataset cnn_dailymail` to use a non-gated alternative.
77+
- **NFS permissions**: Docker + NFS root_squash causes `PermissionError` on output/cache dirs. See `skills/common/slurm-setup.md` section 5 for fixes.

.claude/skills/ptq/references/unsupported-models.md

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -49,14 +49,16 @@ print(type(cfg).__name__)
4949
grep -r "class <ArchName>" /tmp/transformers-main/src/transformers/models/
5050
```
5151

52-
- **Found** → install from that clone: `pip install /tmp/transformers-main --quiet`, then re-run `AutoConfig.from_pretrained()`.
52+
- **Found** → install with deps: `pip install /tmp/transformers-main`, then re-run `AutoConfig.from_pretrained()`. **Important**: if ModelOpt is already installed, its `[hf]` extras may have pinned an older transformers. Install ModelOpt first, then upgrade transformers **after** (with deps, not `--no-deps`) so compatible `huggingface_hub` and other transitive deps are pulled in.
5353
- **Not found** → ask the user: *"The checkpoint uses `<ArchName>` which isn't in released or main-branch transformers. Do you have a private fork or custom modeling code?"*
5454
5555
- **No `config.json`** → not a standard HF checkpoint. List the directory for README or `.py` files. If nothing useful, ask the user for the modeling code.
5656
5757
## Step B — Is the checkpoint already FP8-quantized?
5858
59-
Check `config.json` for `"quantization_config"` or scan weight files for `*_scale_inv*` tensors. If found, the model must be dequantized before re-quantizing. HuggingFace's `WeightConverter` only handles standard `weight` / `weight_scale_inv` names and will silently miss non-standard parameter names (e.g., 3D expert tensors in MoE layers). See **Pattern 5** below.
59+
Check `config.json` for `"quantization_config"` with `"quant_method": "fp8"`, or scan weight files for `*_scale_inv*` tensors. If the model uses standard `FP8Linear` modules (2D weights with `weight` + `weight_scale_inv`), ModelOpt's `_QuantFP8Linear` plugin handles them automatically — no manual dequantization needed. The plugin keeps weights in FP8 and dequantizes lazily during calibration, which is memory-efficient.
60+
61+
Manual dequantization is only needed for **non-standard parameter names** (e.g., 3D expert tensors in MoE layers) that the plugin doesn't cover. See **Pattern 5** below.
6062
6163
## Step C — Determine what custom patches are needed
6264
@@ -69,7 +71,7 @@ Custom patches are required when:
6971
- **Fused/batched expert weights** — experts stored as a single parameter (e.g., 3D `[num_experts, in, out]`) rather than separate `nn.Linear` modules → Pattern 1 + 3
7072
- **Self-defined weight parameters** (`nn.Parameter` used directly instead of `nn.Linear`) — common in non-HF or research models → Pattern 1 + 3
7173
- **VLM structure** (vision encoder that should be excluded) → Pattern 4
72-
- **FP8 checkpoint** that needs dequantization before re-quantizing → Pattern 5
74+
- **FP8 checkpoint with non-standard parameter names** (standard `FP8Linear` is handled automatically by the `_QuantFP8Linear` plugin) → Pattern 5
7375
7476
## Step D — Check weight names against ModelOpt's config patterns
7577
@@ -187,7 +189,9 @@ Both methods replace all instances of `original_cls` with `quantized_cls` during
187189
188190
## Pattern 4: VLM Language Model Extraction
189191
190-
For multimodal models, only quantize the language model backbone:
192+
**Note**: `hf_ptq.py` already handles VLMs automatically via `extract_and_prepare_language_model_from_vl()`. It detects multimodal models, extracts the language backbone, and disables quantization for vision/projector modules. This works for most VLMs (tested with Mistral3/Devstral, Nemotron VL, Llama VL, etc.) — try `hf_ptq.py` first before writing custom VLM handling.
193+
194+
For custom scripts or when `hf_ptq.py` doesn't handle the VLM correctly, only quantize the language model backbone:
191195
192196
```python
193197
from modelopt.torch.export.model_utils import get_language_model_from_vl, is_multimodal_model
@@ -218,30 +222,32 @@ quant_cfg["quant_cfg"]["*multi_modal_projector*"] = {"enable": False}
218222
219223
**Known VLM export issue**: The export step (`requantize_resmooth_fused_llm_layers` in `unified_export_hf.py`) may try to run a dummy forward pass on the full VLM instead of the language model backbone. This currently only handles Nemotron VLMs. If hit, patch the export to use `is_multimodal_model()` for the VLM check instead of model-specific string matching.
220224
221-
## Pattern 5: FP8 Checkpoint Dequantization
225+
## Pattern 5: FP8 Checkpoint Handling
226+
227+
### Standard FP8Linear modules (preferred — no action needed)
222228
223-
### Standard nn.Linear weights
229+
ModelOpt's `_QuantFP8Linear` plugin (`modelopt/torch/quantization/plugins/huggingface.py`) automatically handles HuggingFace `FP8Linear` modules. It:
224230
225-
HuggingFace handles these automatically with `dequantize=True`:
231+
1. Keeps weights **compact in FP8** in GPU memory during calibration
232+
2. **Dequantizes lazily** on-the-fly during calibration forward passes via `weight_dequant()`
233+
3. Has `unpack_weight()` for full dequantization at export time
234+
235+
This is registered automatically for `transformers.integrations.finegrained_fp8.FP8Linear`. It requires **Triton** to be installed (used internally for FP8 dequantization kernels). Just load the model normally — no `FineGrainedFP8Config(dequantize=True)` needed:
226236
227237
```python
228-
from transformers.utils.quantization_config import FineGrainedFP8Config
229-
230-
model = AutoModel.from_pretrained(
231-
model_path,
232-
torch_dtype=torch.bfloat16,
233-
device_map="auto",
234-
quantization_config=FineGrainedFP8Config(dequantize=True),
235-
)
238+
model = AutoModel.from_pretrained(model_path, device_map="auto", torch_dtype="auto")
239+
# FP8Linear modules stay in FP8 → _QuantFP8Linear handles dequant during calibration
236240
```
237241
242+
**Do NOT use `FineGrainedFP8Config(dequantize=True)`** — it expands the entire model to BF16 upfront, wasting ~2x GPU memory. The plugin approach is both more memory-efficient and simpler.
243+
238244
### Non-standard parameter names (e.g., 3D expert weights)
239245
240-
HF's `WeightConverter` uses source patterns `["weight$", "weight_scale_inv", "activation_scale"]`. Parameters with names like `gate_up_proj`, `down_proj`, `w1`, `w2`, `w3` won't match these patterns and will remain in FP8 after loading. Dequantize them manually:
246+
The `_QuantFP8Linear` plugin only handles standard 2D `FP8Linear` modules with `weight` + `weight_scale_inv`. Parameters with non-standard names (e.g., `gate_up_proj`, `down_proj`, `w1`/`w2`/`w3` in fused MoE experts) won't be covered. For these, dequantize manually after loading:
241247
242248
```python
243249
def dequantize_fp8_params(model, param_names=("gate_up_proj", "down_proj")):
244-
"""Dequantize remaining FP8 parameters that HF's WeightConverter missed."""
250+
"""Dequantize remaining FP8 parameters that the plugin doesn't cover."""
245251
count = 0
246252
for name, module in model.named_modules():
247253
for param_name in param_names:
@@ -252,10 +258,8 @@ def dequantize_fp8_params(model, param_names=("gate_up_proj", "down_proj")):
252258
if scale is None:
253259
param.data = param.data.to(torch.bfloat16)
254260
elif scale.dim() == 1:
255-
# Per-tensor scale
256261
param.data = param.data.to(torch.bfloat16) * scale.data[:, None, None].to(torch.bfloat16)
257262
elif scale.dim() == 3:
258-
# Per-block scale: reshape, broadcast, multiply
259263
w = param.data
260264
s = scale.data
261265
assert w.shape[-2] % s.shape[-2] == 0 and w.shape[-1] % s.shape[-1] == 0, (

.claude/skills/ptq/tests.json

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,21 @@
5757
"Runs hf_ptq.py (not a standalone custom script)",
5858
"Runs smoke test first, then full calibration"
5959
]
60+
},
61+
{
62+
"id": 5,
63+
"prompt": "Quantize MiniMax-M2.5 to nvfp4",
64+
"expected_output": "Agent detects FP8 pre-quantized checkpoint, relies on _QuantFP8Linear plugin for standard FP8Linear modules, dequantizes non-standard MoE expert weights manually, then runs PTQ",
65+
"files": [],
66+
"expectations": [
67+
"Checks README — MiniMax-M2.5 is NOT listed",
68+
"Reads unsupported-models.md (4C path)",
69+
"Detects FP8 quantization_config in config.json (Step B)",
70+
"Identifies _QuantFP8Linear plugin handles standard FP8Linear modules automatically",
71+
"Identifies non-standard 3D MoE expert weights that need manual dequantization (Pattern 5)",
72+
"Applies manual dequantize_fp8_params for fused expert tensors",
73+
"Runs smoke test first, then full calibration"
74+
]
6075
}
6176
]
6277
}

.github/workflows/example_tests.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -70,19 +70,19 @@ jobs:
7070
uses: ./.github/workflows/_example_tests_runner.yml
7171
secrets: inherit
7272
with:
73-
docker_image: "nvcr.io/nvidia/pytorch:${{ matrix.docker_image || '26.01' }}-py3"
73+
docker_image: "nvcr.io/nvidia/pytorch:${{ matrix.docker_image || '26.03' }}-py3"
7474
example: ${{ matrix.example }}
7575
timeout_minutes: 30
7676
pip_install_extras: "[hf,dev-test]"
77-
runner: linux-amd64-gpu-h100-latest-1
77+
runner: linux-amd64-gpu-rtxpro6000-latest-1
7878

7979
torch-non-pr:
8080
if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
8181
strategy: *torch_strategy
8282
uses: ./.github/workflows/_example_tests_runner.yml
8383
secrets: inherit
8484
with:
85-
docker_image: "nvcr.io/nvidia/pytorch:${{ matrix.docker_image || '26.01' }}-py3"
85+
docker_image: "nvcr.io/nvidia/pytorch:${{ matrix.docker_image || '26.03' }}-py3"
8686
example: ${{ matrix.example }}
8787
timeout_minutes: 30
8888
pip_install_extras: "[hf,dev-test]"
@@ -99,7 +99,7 @@ jobs:
9999
uses: ./.github/workflows/_example_tests_runner.yml
100100
secrets: inherit
101101
with:
102-
docker_image: "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5"
102+
docker_image: "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10"
103103
example: ${{ matrix.example }}
104104
pip_install_extras: "[hf,dev-test]"
105105
runner: linux-amd64-gpu-rtxpro6000-latest-1
@@ -113,7 +113,7 @@ jobs:
113113
uses: ./.github/workflows/_example_tests_runner.yml
114114
secrets: inherit
115115
with:
116-
docker_image: "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5"
116+
docker_image: "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10"
117117
example: ${{ matrix.example }}
118118
pip_install_extras: "[hf,dev-test]"
119119
runner: linux-amd64-gpu-rtxpro6000-latest-2
@@ -161,7 +161,7 @@ jobs:
161161
docker_image: "nvcr.io/nvidia/tensorrt:26.02-py3"
162162
example: ${{ matrix.example }}
163163
pip_install_extras: "[all,dev-test]"
164-
runner: linux-amd64-gpu-l4-latest-1
164+
runner: linux-amd64-gpu-rtxpro6000-latest-1
165165

166166
onnx-non-pr:
167167
if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}

.github/workflows/gpu_tests.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,18 +65,19 @@ jobs:
6565
- example: gpu
6666
timeout: 60
6767
container_image: pytorch:26.01-py3
68+
# tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
6869
- example: gpu-megatron
6970
timeout: 45
7071
container_image: pytorch:26.01-py3
7172
- example: gpu-trtllm
7273
timeout: 30
73-
container_image: tensorrt-llm/release:1.3.0rc5
74+
container_image: tensorrt-llm/release:1.3.0rc10
7475
runs-on: linux-amd64-gpu-rtxpro6000-latest-1
7576
timeout-minutes: ${{ matrix.timeout }}
7677
container: &gpu_container
7778
image: nvcr.io/nvidia/${{ matrix.container_image }}
7879
env:
79-
GIT_DEPTH: 1000 # For correct version for tests/gpu/torch/quantization/plugins/test_megatron.py
80+
GIT_DEPTH: 1000 # For correct version
8081
PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages
8182
HF_TOKEN: ${{ secrets.HF_TOKEN }}
8283
steps: &gpu_steps

.github/workflows/unit_tests.yml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ jobs:
3838
- uses: actions/checkout@v6
3939
- uses: ./.github/actions/ubuntu-setup
4040
- name: Run unit tests
41-
run: pip install tox && COV_ARGS="--cov" tox -e py312-torch210-tf_latest-unit
41+
run: pip install tox && COV_ARGS="--cov" tox -e py312-torch211-tf_latest-unit
4242
- name: Upload coverage reports to Codecov
4343
uses: codecov/codecov-action@v5
4444
with:
@@ -65,6 +65,7 @@ jobs:
6565
runs-on: ubuntu-latest
6666
timeout-minutes: 30
6767
strategy:
68+
fail-fast: false
6869
matrix:
6970
py: [10, 11, 13]
7071
steps:
@@ -73,15 +74,16 @@ jobs:
7374
with:
7475
python-version: "3.${{ matrix.py }}"
7576
- name: Run unit tests
76-
run: pip install tox && tox -e py3${{ matrix.py }}-torch210-tf_latest-unit
77+
run: pip install tox && tox -e py3${{ matrix.py }}-torch211-tf_latest-unit
7778
multi-torch:
7879
if: github.event_name == 'pull_request'
7980
needs: [linux]
8081
runs-on: ubuntu-latest
8182
timeout-minutes: 30
8283
strategy:
84+
fail-fast: false
8385
matrix:
84-
torch: [26, 27, 28, 29]
86+
torch: [28, 29, 210]
8587
steps:
8688
- uses: actions/checkout@v6
8789
- uses: ./.github/actions/ubuntu-setup
@@ -93,13 +95,14 @@ jobs:
9395
runs-on: ubuntu-latest
9496
timeout-minutes: 30
9597
strategy:
98+
fail-fast: false
9699
matrix:
97100
tf: [min]
98101
steps:
99102
- uses: actions/checkout@v6
100103
- uses: ./.github/actions/ubuntu-setup
101104
- name: Run unit tests
102-
run: pip install tox && tox -e py312-torch210-tf_${{ matrix.tf }}-unit
105+
run: pip install tox && tox -e py312-torch211-tf_${{ matrix.tf }}-unit
103106
launcher:
104107
if: github.event_name == 'pull_request'
105108
needs: [linux]
@@ -123,6 +126,7 @@ jobs:
123126
runs-on: ubuntu-latest
124127
timeout-minutes: 30
125128
strategy:
129+
fail-fast: false
126130
matrix:
127131
test-env: [onnx, torch]
128132
steps:

0 commit comments

Comments
 (0)