NVIDIA
diff --git a/‎.claude/skills/common/slurm-setup.md‎
Lines changed: 66 additions & 0 deletions b/‎.claude/skills/common/slurm-setup.md‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎.claude/skills/ptq/SKILL.md‎
Lines changed: 9 additions & 3 deletions b/‎.claude/skills/ptq/SKILL.md‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎.claude/skills/ptq/references/slurm-setup-ptq.md‎
Lines changed: 7 additions & 0 deletions b/‎.claude/skills/ptq/references/slurm-setup-ptq.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎.claude/skills/ptq/references/unsupported-models.md‎
Lines changed: 23 additions & 19 deletions b/‎.claude/skills/ptq/references/unsupported-models.md‎
Lines changed: 23 additions & 19 deletions
diff --git a/‎.claude/skills/ptq/tests.json‎
Lines changed: 15 additions & 0 deletions b/‎.claude/skills/ptq/tests.json‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎.github/workflows/example_tests.yml‎
Lines changed: 6 additions & 6 deletions b/‎.github/workflows/example_tests.yml‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎.github/workflows/gpu_tests.yml‎
Lines changed: 3 additions & 2 deletions b/‎.github/workflows/gpu_tests.yml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎.github/workflows/unit_tests.yml‎
Lines changed: 8 additions & 4 deletions b/‎.github/workflows/unit_tests.yml‎
Lines changed: 8 additions & 4 deletions
@@ -74,6 +74,47 @@ include a multi-node-capable partition as the last fallback.
 
 Only submit the full job after the smoke test exits cleanly.
 
+### Docker (non-pyxis) variant
+
+Some clusters don't have pyxis/enroot installed and instead use plain `docker run` on compute nodes. In this case, replace the `srun --container-image` pattern with `docker run` inside the job script:
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=<name>
+#SBATCH --account=<account>
+#SBATCH --partition=<partition>
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=<N>
+#SBATCH --time=<HH:MM:SS>
+#SBATCH --output=<log_dir>/<name>_%j.log
+
+docker run --rm \
+    --gpus all \
+    --shm-size=32g \
+    --ulimit memlock=-1 \
+    --network host \
+    -v <data_root>:<data_root> \
+    -e CALIB_SIZE="${CALIB_SIZE:-512}" \
+    <container_image> \
+    bash <path/to/run_script.sh>
+```
+
+**Key differences from pyxis**:
+
+- No `srun` wrapper needed — SLURM just allocates the node, Docker runs the container
+- Mount paths with `-v` instead of `--container-mounts`
+- Pass env vars with `-e` instead of relying on SLURM env propagation
+- Use the two-script pattern: SLURM wrapper (sbatch directives + `docker run`) and inner runner (the actual work). The inner runner should unset SLURM env vars and set `HF_HOME`/`HF_DATASETS_OFFLINE` as needed
+- **NFS root_squash**: see section 5
+
+**How to detect which pattern to use**: Ask the user how they normally run containers, or check:
+
+```bash
+which enroot 2>/dev/null && echo "pyxis/enroot available"
+which docker 2>/dev/null && echo "docker available"
+```
+
 ---
 
 ## 3. Monitor Until Completion
@@ -126,3 +167,28 @@ srun \
 ```
 
 Adjust `--nodes`, `--gpus-per-node`, and the distributed launch command per your workload.
+
+---
+
+## 5. NFS root_squash and Docker Permissions
+
+Docker containers typically run as root, but NFS filesystems with `root_squash` (the default) map root to `nobody`, blocking writes to directories owned by the user. This causes `PermissionError` when creating cache lock files, writing output, or saving logs.
+
+This affects both pyxis/enroot (`srun --container-image`) and plain `docker run` workflows.
+
+**Preferred fix** — run Docker with the host user's UID/GID to match NFS ownership:
+
+```bash
+docker run --user $(id -u):$(id -g) ...
+```
+
+> Note: `--user` may cause issues if the container expects root for package installation. In that case, fall back to the chmod approach below.
+
+**Fallback fix** — open permissions before submitting the job:
+
+```bash
+chmod -R g+rwX /path/to/workspace/
+chmod -R g+rwX /path/to/.hf_cache/
+```
+
+Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.
@@ -118,10 +118,16 @@ Report the path and size to the user.
 - `mtq.register()` classes **must** define `_setup()` and call it from `__init__`
 - Call `mto.enable_huggingface_checkpointing()` **before** quantization
 - Wildcard `*gate*` matches too broadly — use `*mlp.gate*` or `*router*`
-- VLMs need `AutoModel`, not `AutoModelForCausalLM`
-- FP8 loading: `FineGrainedFP8Config(dequantize=True)`, not a dict
+- VLMs: `hf_ptq.py` auto-extracts the language model via `extract_and_prepare_language_model_from_vl()` — no manual VLM handling needed in most cases
+- FP8 checkpoints: prefer `_QuantFP8Linear` (lazy dequant) over `FineGrainedFP8Config(dequantize=True)` which wastes ~2x memory. See `references/unsupported-models.md` for details
 - Custom quantizer names must end with `_input_quantizer` or `_weight_quantizer`
 
+## Common Pitfalls
+
+- **Transformers version**: Newer models (e.g., Devstral/ministral3) may require a transformers version not yet in the container. Check `config.json` for `transformers_version` and upgrade if needed. Install ModelOpt first, then upgrade transformers **with** deps (not `--no-deps`) to pull compatible `huggingface_hub`
+- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
+- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Use `docker run --user $(id -u):$(id -g)`, or `chmod -R a+rwX` on needed directories as a fallback. See `skills/common/slurm-setup.md` section 5
+
 ## References
 
 | Reference | When to read |
@@ -133,7 +139,7 @@ Report the path and size to the user.
 | `references/unsupported-models.md` | Step 4C only (unlisted model) |
 | `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote |
 | `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) |
-| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, FSDP2) |
+| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
 | `examples/llm_ptq/README.md` | Step 3: support matrix, CLI flags, accuracy |
 | `modelopt/torch/quantization/config.py` | Step 3: format definitions |
 | `modelopt/torch/export/model_utils.py` | Step 4C: TRT-LLM export type mapping |
 
@@ -68,3 +68,10 @@ This catches script errors cheaply before using GPU quota on a real run.
 See `skills/common/slurm-setup.md` section 2 for the smoke test partition pattern.
 
 Only submit the full calibration job after the smoke test exits cleanly.
+
+---
+
+## 4. PTQ-Specific Notes
+
+- **Gated datasets**: Some calibration datasets (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Set `HF_TOKEN` in the job environment, or use `--dataset cnn_dailymail` to use a non-gated alternative.
+- **NFS permissions**: Docker + NFS root_squash causes `PermissionError` on output/cache dirs. See `skills/common/slurm-setup.md` section 5 for fixes.
@@ -49,14 +49,16 @@ print(type(cfg).__name__)
      grep -r "class <ArchName>" /tmp/transformers-main/src/transformers/models/
      ```
 
-     - **Found** → install from that clone: `pip install /tmp/transformers-main --quiet`, then re-run `AutoConfig.from_pretrained()`.
+     - **Found** → install with deps: `pip install /tmp/transformers-main`, then re-run `AutoConfig.from_pretrained()`. **Important**: if ModelOpt is already installed, its `[hf]` extras may have pinned an older transformers. Install ModelOpt first, then upgrade transformers **after** (with deps, not `--no-deps`) so compatible `huggingface_hub` and other transitive deps are pulled in.
      - **Not found** → ask the user: *"The checkpoint uses `<ArchName>` which isn't in released or main-branch transformers. Do you have a private fork or custom modeling code?"*
 
 - **No `config.json`** → not a standard HF checkpoint. List the directory for README or `.py` files. If nothing useful, ask the user for the modeling code.
 
 ## Step B — Is the checkpoint already FP8-quantized?
 
-Check `config.json` for `"quantization_config"` or scan weight files for `*_scale_inv*` tensors. If found, the model must be dequantized before re-quantizing. HuggingFace's `WeightConverter` only handles standard `weight` / `weight_scale_inv` names and will silently miss non-standard parameter names (e.g., 3D expert tensors in MoE layers). See **Pattern 5** below.
+Check `config.json` for `"quantization_config"` with `"quant_method": "fp8"`, or scan weight files for `*_scale_inv*` tensors. If the model uses standard `FP8Linear` modules (2D weights with `weight` + `weight_scale_inv`), ModelOpt's `_QuantFP8Linear` plugin handles them automatically — no manual dequantization needed. The plugin keeps weights in FP8 and dequantizes lazily during calibration, which is memory-efficient.
+
+Manual dequantization is only needed for **non-standard parameter names** (e.g., 3D expert tensors in MoE layers) that the plugin doesn't cover. See **Pattern 5** below.
 
 ## Step C — Determine what custom patches are needed
 
@@ -69,7 +71,7 @@ Custom patches are required when:
 - **Fused/batched expert weights** — experts stored as a single parameter (e.g., 3D `[num_experts, in, out]`) rather than separate `nn.Linear` modules → Pattern 1 + 3
 - **Self-defined weight parameters** (`nn.Parameter` used directly instead of `nn.Linear`) — common in non-HF or research models → Pattern 1 + 3
 - **VLM structure** (vision encoder that should be excluded) → Pattern 4
-- **FP8 checkpoint** that needs dequantization before re-quantizing → Pattern 5
+- **FP8 checkpoint with non-standard parameter names** (standard `FP8Linear` is handled automatically by the `_QuantFP8Linear` plugin) → Pattern 5
 
 ## Step D — Check weight names against ModelOpt's config patterns
 
@@ -187,7 +189,9 @@ Both methods replace all instances of `original_cls` with `quantized_cls` during
 
 ## Pattern 4: VLM Language Model Extraction
 
-For multimodal models, only quantize the language model backbone:
+**Note**: `hf_ptq.py` already handles VLMs automatically via `extract_and_prepare_language_model_from_vl()`. It detects multimodal models, extracts the language backbone, and disables quantization for vision/projector modules. This works for most VLMs (tested with Mistral3/Devstral, Nemotron VL, Llama VL, etc.) — try `hf_ptq.py` first before writing custom VLM handling.
+
+For custom scripts or when `hf_ptq.py` doesn't handle the VLM correctly, only quantize the language model backbone:
 
 ```python
 from modelopt.torch.export.model_utils import get_language_model_from_vl, is_multimodal_model
@@ -218,30 +222,32 @@ quant_cfg["quant_cfg"]["*multi_modal_projector*"] = {"enable": False}
 
 **Known VLM export issue**: The export step (`requantize_resmooth_fused_llm_layers` in `unified_export_hf.py`) may try to run a dummy forward pass on the full VLM instead of the language model backbone. This currently only handles Nemotron VLMs. If hit, patch the export to use `is_multimodal_model()` for the VLM check instead of model-specific string matching.
 
-## Pattern 5: FP8 Checkpoint Dequantization
+## Pattern 5: FP8 Checkpoint Handling
+
+### Standard FP8Linear modules (preferred — no action needed)
 
-### Standard nn.Linear weights
+ModelOpt's `_QuantFP8Linear` plugin (`modelopt/torch/quantization/plugins/huggingface.py`) automatically handles HuggingFace `FP8Linear` modules. It:
 
-HuggingFace handles these automatically with `dequantize=True`:
+1. Keeps weights **compact in FP8** in GPU memory during calibration
+2. **Dequantizes lazily** on-the-fly during calibration forward passes via `weight_dequant()`
+3. Has `unpack_weight()` for full dequantization at export time
+
+This is registered automatically for `transformers.integrations.finegrained_fp8.FP8Linear`. It requires **Triton** to be installed (used internally for FP8 dequantization kernels). Just load the model normally — no `FineGrainedFP8Config(dequantize=True)` needed:
 
 ```python
-from transformers.utils.quantization_config import FineGrainedFP8Config
-
-model = AutoModel.from_pretrained(
-    model_path,
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=FineGrainedFP8Config(dequantize=True),
-)
+model = AutoModel.from_pretrained(model_path, device_map="auto", torch_dtype="auto")
+# FP8Linear modules stay in FP8 → _QuantFP8Linear handles dequant during calibration
 ```
 
+**Do NOT use `FineGrainedFP8Config(dequantize=True)`** — it expands the entire model to BF16 upfront, wasting ~2x GPU memory. The plugin approach is both more memory-efficient and simpler.
+
 ### Non-standard parameter names (e.g., 3D expert weights)
 
-HF's `WeightConverter` uses source patterns `["weight$", "weight_scale_inv", "activation_scale"]`. Parameters with names like `gate_up_proj`, `down_proj`, `w1`, `w2`, `w3` won't match these patterns and will remain in FP8 after loading. Dequantize them manually:
+The `_QuantFP8Linear` plugin only handles standard 2D `FP8Linear` modules with `weight` + `weight_scale_inv`. Parameters with non-standard names (e.g., `gate_up_proj`, `down_proj`, `w1`/`w2`/`w3` in fused MoE experts) won't be covered. For these, dequantize manually after loading:
 
 ```python
 def dequantize_fp8_params(model, param_names=("gate_up_proj", "down_proj")):
-    """Dequantize remaining FP8 parameters that HF's WeightConverter missed."""
+    """Dequantize remaining FP8 parameters that the plugin doesn't cover."""
     count = 0
     for name, module in model.named_modules():
         for param_name in param_names:
@@ -252,10 +258,8 @@ def dequantize_fp8_params(model, param_names=("gate_up_proj", "down_proj")):
             if scale is None:
                 param.data = param.data.to(torch.bfloat16)
             elif scale.dim() == 1:
-                # Per-tensor scale
                 param.data = param.data.to(torch.bfloat16) * scale.data[:, None, None].to(torch.bfloat16)
             elif scale.dim() == 3:
-                # Per-block scale: reshape, broadcast, multiply
                 w = param.data
                 s = scale.data
                 assert w.shape[-2] % s.shape[-2] == 0 and w.shape[-1] % s.shape[-1] == 0, (
 
@@ -57,6 +57,21 @@
         "Runs hf_ptq.py (not a standalone custom script)",
         "Runs smoke test first, then full calibration"
       ]
+    },
+    {
+      "id": 5,
+      "prompt": "Quantize MiniMax-M2.5 to nvfp4",
+      "expected_output": "Agent detects FP8 pre-quantized checkpoint, relies on _QuantFP8Linear plugin for standard FP8Linear modules, dequantizes non-standard MoE expert weights manually, then runs PTQ",
+      "files": [],
+      "expectations": [
+        "Checks README — MiniMax-M2.5 is NOT listed",
+        "Reads unsupported-models.md (4C path)",
+        "Detects FP8 quantization_config in config.json (Step B)",
+        "Identifies _QuantFP8Linear plugin handles standard FP8Linear modules automatically",
+        "Identifies non-standard 3D MoE expert weights that need manual dequantization (Pattern 5)",
+        "Applies manual dequantize_fp8_params for fused expert tensors",
+        "Runs smoke test first, then full calibration"
+      ]
     }
   ]
 }
@@ -70,19 +70,19 @@ jobs:
     uses: ./.github/workflows/_example_tests_runner.yml
     secrets: inherit
     with:
-      docker_image: "nvcr.io/nvidia/pytorch:${{ matrix.docker_image || '26.01' }}-py3"
+      docker_image: "nvcr.io/nvidia/pytorch:${{ matrix.docker_image || '26.03' }}-py3"
       example: ${{ matrix.example }}
       timeout_minutes: 30
       pip_install_extras: "[hf,dev-test]"
-      runner: linux-amd64-gpu-h100-latest-1
+      runner: linux-amd64-gpu-rtxpro6000-latest-1
 
   torch-non-pr:
     if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
     strategy: *torch_strategy
     uses: ./.github/workflows/_example_tests_runner.yml
     secrets: inherit
     with:
-      docker_image: "nvcr.io/nvidia/pytorch:${{ matrix.docker_image || '26.01' }}-py3"
+      docker_image: "nvcr.io/nvidia/pytorch:${{ matrix.docker_image || '26.03' }}-py3"
       example: ${{ matrix.example }}
       timeout_minutes: 30
       pip_install_extras: "[hf,dev-test]"
@@ -99,7 +99,7 @@ jobs:
     uses: ./.github/workflows/_example_tests_runner.yml
     secrets: inherit
     with:
-      docker_image: "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5"
+      docker_image: "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10"
       example: ${{ matrix.example }}
       pip_install_extras: "[hf,dev-test]"
       runner: linux-amd64-gpu-rtxpro6000-latest-1
@@ -113,7 +113,7 @@ jobs:
     uses: ./.github/workflows/_example_tests_runner.yml
     secrets: inherit
     with:
-      docker_image: "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5"
+      docker_image: "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10"
       example: ${{ matrix.example }}
       pip_install_extras: "[hf,dev-test]"
       runner: linux-amd64-gpu-rtxpro6000-latest-2
@@ -161,7 +161,7 @@ jobs:
       docker_image: "nvcr.io/nvidia/tensorrt:26.02-py3"
       example: ${{ matrix.example }}
       pip_install_extras: "[all,dev-test]"
-      runner: linux-amd64-gpu-l4-latest-1
+      runner: linux-amd64-gpu-rtxpro6000-latest-1
 
   onnx-non-pr:
     if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
 
@@ -65,18 +65,19 @@ jobs:
           - example: gpu
             timeout: 60
             container_image: pytorch:26.01-py3
+            # tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
           - example: gpu-megatron
             timeout: 45
             container_image: pytorch:26.01-py3
           - example: gpu-trtllm
             timeout: 30
-            container_image: tensorrt-llm/release:1.3.0rc5
+            container_image: tensorrt-llm/release:1.3.0rc10
     runs-on: linux-amd64-gpu-rtxpro6000-latest-1
     timeout-minutes: ${{ matrix.timeout }}
     container: &gpu_container
       image: nvcr.io/nvidia/${{ matrix.container_image }}
       env:
-        GIT_DEPTH: 1000 # For correct version for tests/gpu/torch/quantization/plugins/test_megatron.py
+        GIT_DEPTH: 1000 # For correct version
         PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages
         HF_TOKEN: ${{ secrets.HF_TOKEN }}
     steps: &gpu_steps
 
@@ -38,7 +38,7 @@ jobs:
       - uses: actions/checkout@v6
       - uses: ./.github/actions/ubuntu-setup
       - name: Run unit tests
-        run: pip install tox && COV_ARGS="--cov" tox -e py312-torch210-tf_latest-unit
+        run: pip install tox && COV_ARGS="--cov" tox -e py312-torch211-tf_latest-unit
       - name: Upload coverage reports to Codecov
         uses: codecov/codecov-action@v5
         with:
@@ -65,6 +65,7 @@ jobs:
     runs-on: ubuntu-latest
     timeout-minutes: 30
     strategy:
+      fail-fast: false
       matrix:
         py: [10, 11, 13]
     steps:
@@ -73,15 +74,16 @@ jobs:
         with:
           python-version: "3.${{ matrix.py }}"
       - name: Run unit tests
-        run: pip install tox && tox -e py3${{ matrix.py }}-torch210-tf_latest-unit
+        run: pip install tox && tox -e py3${{ matrix.py }}-torch211-tf_latest-unit
   multi-torch:
     if: github.event_name == 'pull_request'
     needs: [linux]
     runs-on: ubuntu-latest
     timeout-minutes: 30
     strategy:
+      fail-fast: false
       matrix:
-        torch: [26, 27, 28, 29]
+        torch: [28, 29, 210]
     steps:
       - uses: actions/checkout@v6
       - uses: ./.github/actions/ubuntu-setup
@@ -93,13 +95,14 @@ jobs:
     runs-on: ubuntu-latest
     timeout-minutes: 30
     strategy:
+      fail-fast: false
       matrix:
         tf: [min]
     steps:
       - uses: actions/checkout@v6
       - uses: ./.github/actions/ubuntu-setup
       - name: Run unit tests
-        run: pip install tox && tox -e py312-torch210-tf_${{ matrix.tf }}-unit
+        run: pip install tox && tox -e py312-torch211-tf_${{ matrix.tf }}-unit
   launcher:
     if: github.event_name == 'pull_request'
     needs: [linux]
@@ -123,6 +126,7 @@ jobs:
     runs-on: ubuntu-latest
     timeout-minutes: 30
     strategy:
+      fail-fast: false
       matrix:
         test-env: [onnx, torch]
     steps:
Original file line number	Diff line number	Diff line change
`@@ -57,6 +57,21 @@`
`57`	`57`	`"Runs hf_ptq.py (not a standalone custom script)",`
`58`	`58`	`"Runs smoke test first, then full calibration"`
`59`	`59`	`]`
	`60`	`+ },`
	`61`	`+ {`
	`62`	`+ "id": 5,`
	`63`	`+ "prompt": "Quantize MiniMax-M2.5 to nvfp4",`
	`64`	`+ "expected_output": "Agent detects FP8 pre-quantized checkpoint, relies on _QuantFP8Linear plugin for standard FP8Linear modules, dequantizes non-standard MoE expert weights manually, then runs PTQ",`
	`65`	`+ "files": [],`
	`66`	`+ "expectations": [`
	`67`	`+ "Checks README — MiniMax-M2.5 is NOT listed",`
	`68`	`+ "Reads unsupported-models.md (4C path)",`
	`69`	`+ "Detects FP8 quantization_config in config.json (Step B)",`
	`70`	`+ "Identifies _QuantFP8Linear plugin handles standard FP8Linear modules automatically",`
	`71`	`+ "Identifies non-standard 3D MoE expert weights that need manual dequantization (Pattern 5)",`
	`72`	`+ "Applies manual dequantize_fp8_params for fused expert tensors",`
	`73`	`+ "Runs smoke test first, then full calibration"`
	`74`	`+ ]`
`60`	`75`	`}`
`61`	`76`	`]`
`62`	`77`	`}`