Skip to content

Commit 0f28f4c

Browse files
Edwardf0t1claude
andcommitted
Address mxinO review: move common concerns out of PTQ-specific docs
- Move NFS root_squash section to common/slurm-setup.md (not PTQ-specific) - Remove offline compute nodes section; replace with gated dataset HF_TOKEN note - Compact FP8 rule in SKILL.md to core principle, defer to unsupported-models.md - Promote `docker run --user` as preferred fix over chmod (CodeRabbit feedback) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent 8bcfe75 commit 0f28f4c

3 files changed

Lines changed: 33 additions & 64 deletions

File tree

.claude/skills/common/slurm-setup.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ docker run --rm \
107107
- Mount paths with `-v` instead of `--container-mounts`
108108
- Pass env vars with `-e` instead of relying on SLURM env propagation
109109
- Use the two-script pattern: SLURM wrapper (sbatch directives + `docker run`) and inner runner (the actual work). The inner runner should unset SLURM env vars and set `HF_HOME`/`HF_DATASETS_OFFLINE` as needed
110-
- **NFS root_squash**: Docker runs as root by default, which NFS squashes to `nobody`. Run `chmod -R a+rwX` on all output/cache directories before submitting, or use `--user $(id -u):$(id -g)` in the `docker run` command
110+
- **NFS root_squash**: see section 5
111111

112112
**How to detect which pattern to use**: Ask the user how they normally run containers, or check:
113113

@@ -168,3 +168,28 @@ srun \
168168
```
169169

170170
Adjust `--nodes`, `--gpus-per-node`, and the distributed launch command per your workload.
171+
172+
---
173+
174+
## 5. NFS root_squash and Docker Permissions
175+
176+
Docker containers typically run as root, but NFS filesystems with `root_squash` (the default) map root to `nobody`, blocking writes to directories owned by the user. This causes `PermissionError` when creating cache lock files, writing output, or saving logs.
177+
178+
This affects both pyxis/enroot (`srun --container-image`) and plain `docker run` workflows.
179+
180+
**Preferred fix** — run Docker with the host user's UID/GID to match NFS ownership:
181+
182+
```bash
183+
docker run --user $(id -u):$(id -g) ...
184+
```
185+
186+
> Note: `--user` may cause issues if the container expects root for package installation. In that case, fall back to the chmod approach below.
187+
188+
**Fallback fix** — open permissions before submitting the job:
189+
190+
```bash
191+
chmod -R a+rwX /path/to/workspace/
192+
chmod -R a+rwX /path/to/.hf_cache/
193+
```
194+
195+
Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.

.claude/skills/ptq/SKILL.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -119,14 +119,14 @@ Report the path and size to the user.
119119
- Call `mto.enable_huggingface_checkpointing()` **before** quantization
120120
- Wildcard `*gate*` matches too broadly — use `*mlp.gate*` or `*router*`
121121
- VLMs: `hf_ptq.py` auto-extracts the language model via `extract_and_prepare_language_model_from_vl()` — no manual VLM handling needed in most cases
122-
- FP8 checkpoints: ModelOpt's `_QuantFP8Linear` plugin (in `modelopt/torch/quantization/plugins/huggingface.py`) handles `FP8Linear` modules automatically — it keeps weights compact in FP8 and dequantizes lazily during calibration. Do **not** use `FineGrainedFP8Config(dequantize=True)` as it expands the entire model to BF16 upfront, wasting ~2x memory
122+
- FP8 checkpoints: prefer `_QuantFP8Linear` (lazy dequant) over `FineGrainedFP8Config(dequantize=True)` which wastes ~2x memory. See `references/unsupported-models.md` for details
123123
- Custom quantizer names must end with `_input_quantizer` or `_weight_quantizer`
124124

125125
## Common Pitfalls
126126

127127
- **Transformers version**: Newer models (e.g., Devstral/ministral3) may require a transformers version not yet in the container. Check `config.json` for `transformers_version` and upgrade if needed. Install ModelOpt first, then upgrade transformers **with** deps (not `--no-deps`) to pull compatible `huggingface_hub`
128-
- **SLURM compute nodes offline**: Many clusters block internet from compute nodes. Pre-cache calibration datasets on the login node and use `HF_HOME=<cache_path> HF_DATASETS_OFFLINE=1` in the job script. See `references/slurm-setup-ptq.md` section 4
129-
- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Run `chmod -R a+rwX` on workspace/cache directories before submitting jobs, or use `docker run --user $(id -u):$(id -g)`. See `references/slurm-setup-ptq.md` section 5
128+
- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
129+
- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Use `docker run --user $(id -u):$(id -g)`, or `chmod -R a+rwX` on needed directories as a fallback. See `skills/common/slurm-setup.md` section 5
130130

131131
## References
132132

@@ -139,7 +139,7 @@ Report the path and size to the user.
139139
| `references/unsupported-models.md` | Step 4C only (unlisted model) |
140140
| `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote |
141141
| `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) |
142-
| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, FSDP2) |
142+
| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
143143
| `examples/llm_ptq/README.md` | Step 3: support matrix, CLI flags, accuracy |
144144
| `modelopt/torch/quantization/config.py` | Step 3: format definitions |
145145
| `modelopt/torch/export/model_utils.py` | Step 4C: TRT-LLM export type mapping |

.claude/skills/ptq/references/slurm-setup-ptq.md

Lines changed: 3 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -71,63 +71,7 @@ Only submit the full calibration job after the smoke test exits cleanly.
7171

7272
---
7373

74-
## 4. Dataset Caching for Offline Compute Nodes
74+
## 4. PTQ-Specific Notes
7575

76-
Many SLURM clusters block internet access from compute nodes. Calibration datasets (e.g., `cnn_dailymail`, `nemotron-post-training-dataset-v2`) must be pre-cached on the **login node** (which has internet), then accessed offline from jobs.
77-
78-
**Pre-cache on the login node:**
79-
80-
```bash
81-
# Install datasets library if not available
82-
pip install --user datasets huggingface_hub
83-
84-
# Download to a shared filesystem path
85-
HF_HOME=/path/to/shared/.hf_cache python3 -c "
86-
from datasets import load_dataset
87-
load_dataset('abisee/cnn_dailymail', '3.0.0', split='train', streaming=False)
88-
print('cnn_dailymail cached')
89-
"
90-
```
91-
92-
> **Gated datasets** (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Either set `HF_TOKEN` before downloading, or use `--dataset cnn_dailymail` to skip the gated dataset.
93-
94-
**Fix permissions** (required for Docker — see section 5):
95-
96-
```bash
97-
chmod -R a+rwX /path/to/shared/.hf_cache/
98-
```
99-
100-
**Use in the job script:**
101-
102-
```bash
103-
export HF_HOME="/path/to/shared/.hf_cache"
104-
export HF_DATASETS_OFFLINE=1
105-
export HF_HUB_OFFLINE=1
106-
```
107-
108-
Then pass `--dataset cnn_dailymail` to `hf_ptq.py` to avoid attempting to download uncached datasets.
109-
110-
---
111-
112-
## 5. NFS root_squash and Docker Permissions
113-
114-
Docker containers typically run as root, but NFS filesystems with `root_squash` (the default) map root to `nobody`, blocking writes to directories owned by the user. This causes `PermissionError` when:
115-
116-
- Creating dataset cache lock files
117-
- Writing quantized checkpoint output
118-
- Saving quant summaries or logs
119-
120-
**Fix**: run `chmod -R a+rwX` on all directories the job will write to, **before** submitting the job:
121-
122-
```bash
123-
chmod -R a+rwX /path/to/workspace/
124-
chmod -R a+rwX /path/to/.hf_cache/
125-
```
126-
127-
Alternatively, run Docker with the host user's UID/GID to match NFS ownership:
128-
129-
```bash
130-
docker run --user $(id -u):$(id -g) ...
131-
```
132-
133-
> Note: `--user` may cause issues if the container expects root for package installation. In that case, prefer the `chmod` approach.
76+
- **Gated datasets**: Some calibration datasets (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Set `HF_TOKEN` in the job environment, or use `--dataset cnn_dailymail` to use a non-gated alternative.
77+
- **NFS permissions**: Docker + NFS root_squash causes `PermissionError` on output/cache dirs. See `skills/common/slurm-setup.md` section 5 for fixes.

0 commit comments

Comments
 (0)