Skip to content

Commit c296e30

Browse files
committed
docs: document inter-node MPI fix (FI_TCP_IFACE) and dash3 renderD128 permission issue
1 parent faa9bbb commit c296e30

1 file changed

Lines changed: 68 additions & 0 deletions

File tree

docs/documentation/intel-gpu-max.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,74 @@ allocation cannot open `/dev/dri/renderD128`. Always request the GPU resource:
309309
Without `--gres`, `omp_get_num_devices()` returns 0 and the process aborts with
310310
integer divide-by-zero in `s_initialize_mpi_domain` (rank % num_devices with 0 devices).
311311

312+
**Per-node renderD128 permissions on CRNCH**: `dash4` has `renderD128` as
313+
`crwxrwxrwx` (world-accessible), but `dash3` has `crw-rw----` (render group only).
314+
`--gres=gpu:max_1100:1` does NOT grant cgroup access on dash3 with the current
315+
SLURM configuration; `omp_get_num_devices()` returns 0 on dash3 even within a
316+
SLURM GPU allocation. Contact the CRNCH admin to either fix the device permissions
317+
on dash3 or configure SLURM device cgroups to grant renderD128 access for GPU jobs.
318+
Until fixed, 2-node GPU simulation is not possible using dash3+dash4.
319+
320+
**Inter-node MPI: FI_TCP_IFACE must be set dynamically**: The CRNCH dash nodes
321+
have multiple network interfaces (high-speed 10GbE at `10.10.10.x`, public 1GbE
322+
at `143.215.138.x/25`). Intel MPI's OFI tcp provider selects the highest-speed
323+
interface by default. On dash3, this picks `enp200s0f1np1` (10.10.10.32), which
324+
has no corresponding active interface on dash4. This causes the inter-node MPI
325+
broadcast to hang silently after `MPI_Init` succeeds.
326+
327+
Fix: set `FI_TCP_IFACE` to the name of the interface with the public IP (which
328+
is accessible from all nodes). The interface name differs per node, so set it
329+
dynamically in each rank's startup script:
330+
331+
```bash
332+
IFACE=$(ip -o addr show | awk '/143\.215\.138\.[0-9]+\// {print $2; exit}')
333+
export FI_TCP_IFACE="${IFACE}"
334+
```
335+
336+
This selects `enp3s0f0` on dash3 and `enp3s0f0np0` on dash4. Combined with
337+
`srun --mpi=pmi2` for SLURM-native MPI bootstrap (avoiding Intel MPI hydra/SSH),
338+
this enables successful inter-node MPI communication.
339+
340+
**Recommended 2-node run script pattern** (for when dash3's GPU access is fixed):
341+
342+
```bash
343+
#!/bin/bash
344+
#SBATCH -N 2
345+
#SBATCH --ntasks-per-node=1
346+
#SBATCH -p rg-nextgen-hpc
347+
#SBATCH -w dash3,dash4
348+
#SBATCH --gres=gpu:max_1100:1
349+
#SBATCH --time=01:00:00
350+
351+
INTEL=/net/projects/tools/x86_64/rhel-8/intel-oneapi/2025.1
352+
export PATH=${INTEL}/compiler/2025.0/bin:${INTEL}/mpi/2021.14/bin:${PATH}
353+
export LD_LIBRARY_PATH=${INTEL}/mkl/2025.0/lib:${INTEL}/compiler/2025.0/lib:${INTEL}/2025.0/lib:${INTEL}/mpi/2021.14/lib:${INTEL}/mpi/2021.14/libfabric/lib:${LD_LIBRARY_PATH}
354+
export FI_PROVIDER_PATH=${INTEL}/mpi/2021.14/libfabric/lib/prov
355+
export I_MPI_FABRICS="shm:ofi"
356+
export FI_PROVIDER=tcp
357+
358+
cd /path/to/case
359+
360+
# Step 1: pre-process
361+
WRAP_SCRIPT=$(mktemp)
362+
cat > "$WRAP_SCRIPT" << 'EOF'
363+
IFACE=$(ip -o addr show | awk '/143\.215\.138\.[0-9]+\// {print $2; exit}')
364+
export FI_TCP_IFACE="$IFACE"
365+
exec /path/to/build/install/<hash>/bin/pre_process
366+
EOF
367+
chmod +x "$WRAP_SCRIPT"
368+
srun --mpi=pmi2 -n 2 --ntasks-per-node=1 "$WRAP_SCRIPT"
369+
370+
# Step 2: simulation
371+
cat > "$WRAP_SCRIPT" << 'EOF'
372+
IFACE=$(ip -o addr show | awk '/143\.215\.138\.[0-9]+\// {print $2; exit}')
373+
export FI_TCP_IFACE="$IFACE"
374+
exec /path/to/build/install/<hash>/bin/simulation
375+
EOF
376+
srun --mpi=pmi2 -n 2 --ntasks-per-node=1 "$WRAP_SCRIPT"
377+
rm "$WRAP_SCRIPT"
378+
```
379+
312380
### `libumf.so.1` not found at runtime
313381
The 2026.0 Level Zero and OpenCL UR adapters link against `libumf.so.1`.
314382
If not in `LD_LIBRARY_PATH`, all adapters fail silently and sycl-ls reports

0 commit comments

Comments
 (0)