Address mxinO review: move common concerns out of PTQ-specific docs

Edwardf0t1 · claude · Edwardf0t1 · commit 0f28f4c80b96 · 2026-04-09T12:42:44.000-07:00
- Move NFS root_squash section to common/slurm-setup.md (not PTQ-specific)
- Remove offline compute nodes section; replace with gated dataset HF_TOKEN note
- Compact FP8 rule in SKILL.md to core principle, defer to unsupported-models.md
- Promote `docker run --user` as preferred fix over chmod (CodeRabbit feedback)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Zhiyu Cheng &lt;zhiyuc@nvidia.com&gt;
diff --git a/.claude/skills/common/slurm-setup.md b/.claude/skills/common/slurm-setup.md
@@ -107,7 +107,7 @@ docker run --rm \
 - Mount paths with `-v` instead of `--container-mounts`
 - Pass env vars with `-e` instead of relying on SLURM env propagation
 - Use the two-script pattern: SLURM wrapper (sbatch directives + `docker run`) and inner runner (the actual work). The inner runner should unset SLURM env vars and set `HF_HOME`/`HF_DATASETS_OFFLINE` as needed
-- **NFS root_squash**: Docker runs as root by default, which NFS squashes to `nobody`. Run `chmod -R a+rwX` on all output/cache directories before submitting, or use `--user $(id -u):$(id -g)` in the `docker run` command
+- **NFS root_squash**: see section 5
 
 **How to detect which pattern to use**: Ask the user how they normally run containers, or check:
 
@@ -168,3 +168,28 @@ srun \
 ```
 
 Adjust `--nodes`, `--gpus-per-node`, and the distributed launch command per your workload.
+
+---
+
+## 5. NFS root_squash and Docker Permissions
+
+Docker containers typically run as root, but NFS filesystems with `root_squash` (the default) map root to `nobody`, blocking writes to directories owned by the user. This causes `PermissionError` when creating cache lock files, writing output, or saving logs.
+
+This affects both pyxis/enroot (`srun --container-image`) and plain `docker run` workflows.
+
+**Preferred fix** — run Docker with the host user's UID/GID to match NFS ownership:
+
+```bash
+docker run --user $(id -u):$(id -g) ...
+```
+
+> Note: `--user` may cause issues if the container expects root for package installation. In that case, fall back to the chmod approach below.
+
+**Fallback fix** — open permissions before submitting the job:
+
+```bash
+chmod -R a+rwX /path/to/workspace/
+chmod -R a+rwX /path/to/.hf_cache/
+```
+
+Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.
diff --git a/.claude/skills/ptq/SKILL.md b/.claude/skills/ptq/SKILL.md
@@ -119,14 +119,14 @@ Report the path and size to the user.
 - Call `mto.enable_huggingface_checkpointing()` **before** quantization
 - Wildcard `*gate*` matches too broadly — use `*mlp.gate*` or `*router*`
 - VLMs: `hf_ptq.py` auto-extracts the language model via `extract_and_prepare_language_model_from_vl()` — no manual VLM handling needed in most cases
-- FP8 checkpoints: ModelOpt's `_QuantFP8Linear` plugin (in `modelopt/torch/quantization/plugins/huggingface.py`) handles `FP8Linear` modules automatically — it keeps weights compact in FP8 and dequantizes lazily during calibration. Do **not** use `FineGrainedFP8Config(dequantize=True)` as it expands the entire model to BF16 upfront, wasting ~2x memory
+- FP8 checkpoints: prefer `_QuantFP8Linear` (lazy dequant) over `FineGrainedFP8Config(dequantize=True)` which wastes ~2x memory. See `references/unsupported-models.md` for details
 - Custom quantizer names must end with `_input_quantizer` or `_weight_quantizer`
 
 ## Common Pitfalls
 
 - **Transformers version**: Newer models (e.g., Devstral/ministral3) may require a transformers version not yet in the container. Check `config.json` for `transformers_version` and upgrade if needed. Install ModelOpt first, then upgrade transformers **with** deps (not `--no-deps`) to pull compatible `huggingface_hub`
-- **SLURM compute nodes offline**: Many clusters block internet from compute nodes. Pre-cache calibration datasets on the login node and use `HF_HOME=<cache_path> HF_DATASETS_OFFLINE=1` in the job script. See `references/slurm-setup-ptq.md` section 4
-- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Run `chmod -R a+rwX` on workspace/cache directories before submitting jobs, or use `docker run --user $(id -u):$(id -g)`. See `references/slurm-setup-ptq.md` section 5
+- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
+- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Use `docker run --user $(id -u):$(id -g)`, or `chmod -R a+rwX` on needed directories as a fallback. See `skills/common/slurm-setup.md` section 5
 
 ## References
 
@@ -139,7 +139,7 @@ Report the path and size to the user.
 | `references/unsupported-models.md` | Step 4C only (unlisted model) |
 | `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote |
 | `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) |
-| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, FSDP2) |
+| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
 | `examples/llm_ptq/README.md` | Step 3: support matrix, CLI flags, accuracy |
 | `modelopt/torch/quantization/config.py` | Step 3: format definitions |
 | `modelopt/torch/export/model_utils.py` | Step 4C: TRT-LLM export type mapping |
diff --git a/.claude/skills/ptq/references/slurm-setup-ptq.md b/.claude/skills/ptq/references/slurm-setup-ptq.md
@@ -71,63 +71,7 @@ Only submit the full calibration job after the smoke test exits cleanly.
 
 ---
 
-## 4. Dataset Caching for Offline Compute Nodes
+## 4. PTQ-Specific Notes
 
-Many SLURM clusters block internet access from compute nodes. Calibration datasets (e.g., `cnn_dailymail`, `nemotron-post-training-dataset-v2`) must be pre-cached on the **login node** (which has internet), then accessed offline from jobs.
-
-**Pre-cache on the login node:**
-
-```bash
-# Install datasets library if not available
-pip install --user datasets huggingface_hub
-
-# Download to a shared filesystem path
-HF_HOME=/path/to/shared/.hf_cache python3 -c "
-from datasets import load_dataset
-load_dataset('abisee/cnn_dailymail', '3.0.0', split='train', streaming=False)
-print('cnn_dailymail cached')
-"
-```
-
-> **Gated datasets** (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Either set `HF_TOKEN` before downloading, or use `--dataset cnn_dailymail` to skip the gated dataset.
-
-**Fix permissions** (required for Docker — see section 5):
-
-```bash
-chmod -R a+rwX /path/to/shared/.hf_cache/
-```
-
-**Use in the job script:**
-
-```bash
-export HF_HOME="/path/to/shared/.hf_cache"
-export HF_DATASETS_OFFLINE=1
-export HF_HUB_OFFLINE=1
-```
-
-Then pass `--dataset cnn_dailymail` to `hf_ptq.py` to avoid attempting to download uncached datasets.
-
----
-
-## 5. NFS root_squash and Docker Permissions
-
-Docker containers typically run as root, but NFS filesystems with `root_squash` (the default) map root to `nobody`, blocking writes to directories owned by the user. This causes `PermissionError` when:
-
-- Creating dataset cache lock files
-- Writing quantized checkpoint output
-- Saving quant summaries or logs
-
-**Fix**: run `chmod -R a+rwX` on all directories the job will write to, **before** submitting the job:
-
-```bash
-chmod -R a+rwX /path/to/workspace/
-chmod -R a+rwX /path/to/.hf_cache/
-```
-
-Alternatively, run Docker with the host user's UID/GID to match NFS ownership:
-
-```bash
-docker run --user $(id -u):$(id -g) ...
-```
-
-> Note: `--user` may cause issues if the container expects root for package installation. In that case, prefer the `chmod` approach.
+- **Gated datasets**: Some calibration datasets (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Set `HF_TOKEN` in the job environment, or use `--dataset cnn_dailymail` to use a non-gated alternative.
+- **NFS permissions**: Docker + NFS root_squash causes `PermissionError` on output/cache dirs. See `skills/common/slurm-setup.md` section 5 for fixes.