Skip to content

Detect MPI with Singularity#2216

Open
kinow wants to merge 1 commit intomainfrom
detect-mpi-with-singularity
Open

Detect MPI with Singularity#2216
kinow wants to merge 1 commit intomainfrom
detect-mpi-with-singularity

Conversation

@kinow
Copy link
Copy Markdown
Member

@kinow kinow commented Mar 6, 2026

Related to #2208 (not sure if this is enough to close that issue)

@mr-c, I started some tests yesterday, and completed them today after work. Once I managed to run the workflow on my machine reproducing the error we have on HPCs, debugging it showed that the changes appeared simple to be made.

I managed to keep the --contain, but in order to do that I had to include /dev/shm. I couldn't find any files on /tmp or on the temporary directory created by cwltool. And actually ,cwltool worked without having to remove the temporary directory created by cwltool (i.e. I was wrong, and the process manager wasn't using the temp dir to sync with processes, but apparently shared memory).

I kept the --cleanenv for when the job isn't using the cwltool MPIRequirement. When that requirement is a requirement (not a hint) and is used in the job, then I do not set the --cleanenv.

I tried keeping it, and even tested the MPI config file using the variables from the MPICH documentation.

runner: mpirun.mpich
extra_flags: [
# "likwid-perfctr",
# "-C", "L:N:0",
# "-g", "FLOPS_DP",
# "-o", "/output/path/likwid_%j_%h_%r.json"
]
nproc_flag: -n
env_pass_regex: [
  "MPIEXEC_.*",
  "HYDRA_.*",
  "SMPD_.*",
  "MPICH_.*",
  "DCMF_.*",
  "MPIO_.*",
  "RMS_.*",
  "MPITEST_.*",
  "PMI_*",
  "MPI_*"
  #  "SLURM_.*"
]

(The MPI config file was needed on my laptop as I have OpenMPI & MPICH, so I need to override the runner as mpirun defaults to OpenMPI on my machine, and the container I'm using has MPICH -- I also confirmed with the debugger that the MPI config file values were added)

The issue with the environment variables, @mr-c, is that when cwltool runs, I don't have any environment variables yet.

When we create the CWL job, the runtime config, and even when we pass through the env vars from mpi.py, there are no MPI variables loaded yet.

The Edinburgh paper discusses how to use some variables, but I believe those are only the variable already available. For instance, when you have a Slurm allocation and you have SLURM_JOB_ID, SLURM_MEM_PER_CPU, etc.

When the CWL job is executed, already with the runtime (command + args) created, and then Python launches mpirun.mpich, that's when the MPICH (or Hydra) program will create the environment variables.

(venv) kinow@ranma:~/Development/python/workspace/cwltool$ env | grep MPI
(venv) kinow@ranma:~/Development/python/workspace/cwltool$ mpirun.mpich -n 1 env | grep MPI
MPIR_CVAR_CH3_INTERFACE_HOSTNAME=ranma
MPI_LOCALNRANKS=1
MPI_LOCALRANKID=0
(venv) kinow@ranma:~/Development/python/workspace/cwltool$ mpirun.mpich -n 1 env | grep PMI
PMI_RANK=0
PMI_FD=9
PMI_SIZE=1

I tried setting these variables in the command line of singularity without success. This is what I did:

$ mpirun.mpich -n 1 env | sort > /tmp/mpirun.txt
$ env | sort > /tmp/run.txt
$ meld /tmp/run.txt /tmp/mpirun.txt

Got the list of variables that appeared in the diff. There were 8 extra variables:

  • GFORTRAN_UNBUFFERED_PRECONNECTED=y
  • HYDI_CONTROL_FD=7
  • MPIR_CVAR_CH3_INTERFACE_HOSTNAME=ranma
  • MPI_LOCALNRANKS=1
  • MPI_LOCALRANKID=0
  • PMI_FD=9
  • PMI_RANK=0
  • PMI_SIZE=1

I added them to the Singularity command with just their names to see if it would pass the host env vars to the container (as it wouldn't make sense trying to define random file descriptor, ranks, etc.).

diff --git a/cwltool/singularity.py b/cwltool/singularity.py
index 8e391d64..128df6e2 100644
--- a/cwltool/singularity.py
+++ b/cwltool/singularity.py
@@ -601,6 +601,12 @@ class SingularityCommandLineJob(ContainerCommandLineJob):
         if not mpi_req or not is_req:
             runtime.append("--cleanenv")
         else:
+            runtime.append("--cleanenv")
+            for var in ["PMI_RANK", "PMI_FD", "PMI_SIZE",
+                        "MPI_LOCALNRANKS", "MPI_LOCALRANKID", "MPIR_CVAR_CH3_INTERFACE_HOSTNAME",
+                        "HYDI_CONTROL_FD", "GFORTRAN_UNBUFFERED_PRECONNECTED"]:
+                runtime.append("--env")
+                runtime.append(var)
             self.append_volume(
                 runtime,
                 runtime_context.create_tmpdir(),

The resulting command has the --cleanenv and all the environment variables from the diff:

INFO [job run] /tmp/hib1sp8z$ mpirun.mpich \
    -n \
    2 \
    singularity \
    --quiet \
    run \
    --ipc \
    --contain \
    --cleanenv \
    --env \
    PMI_RANK \
    --env \
    PMI_FD \
    --env \
    PMI_SIZE \
    --env \
    MPI_LOCALNRANKS \
    --env \
    MPI_LOCALRANKID \
    --env \
    MPIR_CVAR_CH3_INTERFACE_HOSTNAME \
    --env \
    HYDI_CONTROL_FD \
    --env \
    GFORTRAN_UNBUFFERED_PRECONNECTED \
    --mount=type=bind,source=/tmp/g7lo1zku,target=/dev/shm \
    --no-eval \
    --userns \
    --home \
    /tmp/hib1sp8z:/PCIKJh \
    --mount=type=bind,source=/tmp/ub89clkk,target=/tmp \
    --mount=type=bind,source=/tmp/h8j2gy1j/a.out,target=/var/lib/cwl/stgfb9c3552-0c10-424c-82ba-284742e69287/a.out,readonly \
    --pwd \
    /PCIKJh \
    --net \
    --network \
    none \
    /home/kinow/Development/python/workspace/cwl-mpi/examples/mpich-sr/cwltool/mfisherman_mpich:4.3.2.sif \
    /var/lib/cwl/stgfb9c3552-0c10-424c-82ba-284742e69287/a.out \
    0 \
    1 > /tmp/hib1sp8z/sr.out 2> /tmp/hib1sp8z/sr.err
WARNING [job run] exited with status: 1
WARNING [job run] completed permanentFail
WARNING [step run] completed permanentFail

That failed again. And running a smaller test looks like only key=value is valid.

$ ARUBA=111 mpirun.mpich -n 2 singularity run --cleanenv --env ARUBA mfisherman_mpich\:4.3.2.sif env | grep ARUBA
Error for command "run": invalid argument "ARUBA" for "--env" flag: ARUBA must be formatted as key=value

Error for command "run": invalid argument "ARUBA" for "--env" flag: ARUBA must be formatted as key=value

But since I don't know the file descriptor, rank, etc., I can't see a way to define PMI_RANK, PMI_SIZE, PMI_FD, and other variables for MPICH.

@mr-c I did not write tests as I wanted to check with you first if this is going in the right direction. I'll run a couple of tests on MN5 and CESGA FT3. My main concern is --contain and InfiniBand. Even without --cleanenv, I think --contain may force the container to use ethernet instead of InfiniBand.

@kinow kinow force-pushed the detect-mpi-with-singularity branch 2 times, most recently from d2b1e14 to bff8585 Compare March 6, 2026 20:31
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.25%. Comparing base (8b78b90) to head (e0d9b20).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2216      +/-   ##
==========================================
+ Coverage   85.12%   85.25%   +0.13%     
==========================================
  Files          46       46              
  Lines        8591     8614      +23     
  Branches     2011     2014       +3     
==========================================
+ Hits         7313     7344      +31     
+ Misses        812      806       -6     
+ Partials      466      464       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kinow
Copy link
Copy Markdown
Member Author

kinow commented Mar 6, 2026

Test successful on BSC MN5! 🎉

  • ✅ Running cwltool on the login node with Singularity
  • ✅ Running cwltool in a Slurm job with 2 CPUs, 1 node with Singularity
  • ✅ Running cwltool in a Slurm job with 4 CPUs, 2 per node, 2 nodes with Singularity
  • ✅ Running cwltool in a Slurm job with 2 CPUs, 1 node with Singularity
  • ✅ Running cwltool in a Slurm job with 4 CPUs, 2 per node, 2 nodes with Singularity
$ cwltool --singularity --enable-ext --enable-dev sr-workflow.cwl sr-workflow-job.yml
...
INFO Using local copy of Singularity image mfisherman_mpich:4.3.2.sif found in /gpfs/home/bsc/$USER/cwl/cwl-mpi/examples/mpich-sr/cwltool
INFO [job run] /scratch/tmp/x8ha_22l$ mpirun \
    -n \
    2 \
    singularity \
    --quiet \
    run \
    --ipc \
    --contain \
    --mount=type=bind,source=/scratch/tmp/r96ss6xn,target=/dev/shm \
    --no-eval \
    --userns \
    --home \
    /scratch/tmp/x8ha_22l:/hjBIvx \
    --mount=type=bind,source=/scratch/tmp/uia0ia52,target=/tmp \
    --mount=type=bind,source=/scratch/tmp/zbn2ofqt/a.out,target=/var/lib/cwl/stg29c22cda-58aa-4d8e-bfc8-d69f33e05729/a.out,readonly \
    --pwd \
    /hjBIvx \
    --net \
    --network \
    none \
    /gpfs/home/bsc/$USER/cwl/cwl-mpi/examples/mpich-sr/cwltool/mfisherman_mpich:4.3.2.sif \
    /var/lib/cwl/stg29c22cda-58aa-4d8e-bfc8-d69f33e05729/a.out \
    0 \
    1 > /scratch/tmp/x8ha_22l/sr.out 2> /scratch/tmp/x8ha_22l/sr.err
...
...
    }
}INFO Final process status is success

For posterity, if you run cwltool on an HPC using multiple nodes, but they are not sharing the same temp directory, you'll get something similar to:

slurmstepd: error: couldn't chdir to `/scratch/tmp/37435333/l0q96kcm': No such file or directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/scratch/tmp/37435333/l0q96kcm': No such file or directory: going to /tmp instead
WARNING: skipping mount of /scratch/tmp/37435333/1db6cool: stat /scratch/tmp/37435333/1db6cool: no such file or directory
WARNING: skipping mount of /scratch/tmp/37435333/e9hzplsn/a.out: stat /scratch/tmp/37435333/e9hzplsn/a.out: no such file or directory
WARNING: skipping mount of /scratch/tmp/37435333/1db6cool: stat /scratch/tmp/37435333/1db6cool: no such file or directory
WARNING: skipping mount of /scratch/tmp/37435333/e9hzplsn/a.out: stat /scratch/tmp/37435333/e9hzplsn/a.out: no such file or directory
FATAL:   container creation failed: mount /scratch/tmp/37435333/1db6cool->/dev/shm error: while mounting /scratch/tmp/37435333/1db6cool: mount source /scratch/tmp/37435333/1db6cool doesn't exist
FATAL:   container creation failed: mount /scratch/tmp/37435333/1db6cool->/dev/shm error: while mounting /scratch/tmp/37435333/1db6cool: mount source /scratch/tmp/37435333/1db6cool doesn't exist

And if you run without network access:

Abort(1017768207) on node 2: 
Fatal error in internal_Init: 
Other MPI error, error stack: 
internal_Init(70)....................: 
MPI_Init(argc=0x7ffd98ef954c, argv=0x7ffd98ef9540) failed MPII_Init_thread(282)................: 
MPIR_init_comm_world(34).............: 
MPIR_Comm_commit(794)................: 
MPIR_Comm_commit_internal(579).......: 
MPID_Comm_commit_pre_hook(151).......: 
MPIDI_world_pre_init(669)............: 
MPIDI_OFI_init_world(805)............: 
MPIDI_OFI_addr_exchange_root_ctx(143): 
MPIDU_bc_allgather(112)..............: 
MPIR_Allgatherv_intra_brucks(80).....: 
MPIC_Sendrecv(259)...................: 
MPID_Isend(60).......................: 
MPIDI_isend(32)......................: 
MPIDI_NM_mpi_isend(780)..............: 
MPIDI_OFI_send_fallback(483).........: 
OFI call tsendv failed (default nic=lo: 
No such file or directory) Abort(1017768207) on node 0: 
Fatal error in internal_Init: 
Other MPI error, error stack: 
internal_Init(70)....................: 
MPI_Init(argc=0x7ffe4702f11c, argv=0x7ffe4702f110) failed MPII_Init_thread(282)................: 
MPIR_init_comm_world(34).............: 
MPIR_Comm_commit(794)................: 
MPIR_Comm_commit_internal(579).......: 
MPID_Comm_commit_pre_hook(151).......: 
MPIDI_world_pre_init(669)............: 
MPIDI_OFI_init_world(805)............: 
MPIDI_OFI_addr_exchange_root_ctx(143): 
MPIDU_bc_allgather(112)..............: 
MPIR_Allgatherv_intra_brucks(80).....: 
MPIC_Sendrecv(259)...................: 
MPID_Isend(60).......................: 
MPIDI_isend(32)......................: 
MPIDI_NM_mpi_isend(780)..............: 
MPIDI_OFI_send_fallback(483).........: 
OFI call tsendv failed (default nic=lo: 
No such file or directory)

Using a temporary directory pointing to the parallel file system (GPFS in the case of MN5) and adding the NetworkAccess with networkAccess: true solved both problems.

@kinow kinow self-assigned this Mar 6, 2026
@kinow kinow requested a review from mr-c March 6, 2026 22:13
@kinow
Copy link
Copy Markdown
Member Author

kinow commented Mar 6, 2026

Test successful on CESGA FT3! 🎉

  • ✅ Running cwltool on the login node with Singularity
  • ✅ Running cwltool in a Slurm job with 2 CPUs, 1 node with Singularity
  • ✅ Running cwltool in a Slurm job with 4 CPUs, 2 per node, 2 nodes with Singularity
  • ✅ Running cwltool in a Slurm job with 2 CPUs, 1 node with Singularity
  • ✅ Running cwltool in a Slurm job with 4 CPUs, 2 per node, 2 nodes with Singularity

I used a folder from Lustre parallel filesystem, $STORE/cwl/temp as temporary storage. Loaded the singularity module, and the tests were executed fine.

Comment thread cwltool/singularity.py Outdated
Comment on lines +601 to +602
if not mpi_req or not is_req:
runtime.append("--cleanenv")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR! Lets always use --cleanenv, to cope with the environment variables sets I recommend creating a shell script to be executed instead of the main command line.

The environment variables set can be derived from either parsing env_pass and env_pass_regex from the --mpi-config-file OR pass in a list of the existing environment variable names and only pass those newly set by the MPI runner.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added --cleanenv back, and added a wrapper script. I will need some time now to test it, but in the meantime feel free to have a look at the code here and at the new Shell script to see if we are going in the right direction :) 👍

Comment thread cwltool/singularity.py
Comment on lines +603 to +608
else:
self.append_volume(
runtime,
runtime_context.create_tmpdir(),
"/dev/shm",
writable=True,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat trick; is this universal for all known MPI implementations? Or do some not use the shared memory device for communication between MPI processes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not defined in the MPI specification. OpenMPI uses shared memory. MPICH also uses shared memory. Cray and Intel are derivatives from MPICH so they likely use it by default too. It appears there are some environment variables to disable it, but I haven't seen any case where shared memory was disabled for MPI. If you think it's useful I can add a new option to disable it in cwltool, or we can do it later if a user requests it?

@kinow kinow force-pushed the detect-mpi-with-singularity branch from 29259b3 to fcc3175 Compare April 26, 2026 17:10
Comment thread cwltool/singularity.py
runtime = [
"singularity",
mpi_req, is_req = self.builder.get_requirement(MPIRequirementName)
mpi_enabled = mpi_req and is_req
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the is_req excludes if MPIRequirement is used as a hint. I'm considering that MPI is used/enabled only if it's a requirement, and not a hint.

@kinow kinow force-pushed the detect-mpi-with-singularity branch 2 times, most recently from 603e6d2 to b09bfbd Compare April 27, 2026 20:22
Comment thread cwltool/mpi.py
self.shm_enabled = shm_enabled
# POSIX only contains functions to handle shared memory, but it does not
# specify the directory to be used, nor if a directory needs to be used
# at all -- ref: https://pubs.opengroup.org/onlinepubs/9699919799/
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like FreeBSD uses objects instead of files as in Linux:

https://man.freebsd.org/cgi/man.cgi?shm_open#:~:text=The%20shm%5Fopen%28%29%20and%20shm%5Funlink%28%29%20functions%20first%20appeared%20in%20FreeBSD%204%2E3%2E%20The%20functions%20were%20reimplemented%20as%20system%20calls%20using%20shared%20memory%20objects%20directly%20rather%20than%20files%20in%20FreeBSD%208%2E0%2E

These two, shm_enabled and shm_dir, are used to control whether the shared memory will be used and the directory to be used, respect/.

And MPICH allows users to disable the shared memory completely, or change its size. https://github.com/pmodels/mpich/blob/main/doc/wiki/faq/Frequently_Asked_Questions.md#:~:text=The%20work,stack

So with these settings I think users are able to match the behavior of that implementation at least.


This script tests a Shell script. This script does not contribute to the
project test coverage (although kcov, or bats+kcov could be used in the
future).
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use kcov with bats to test and track coverage of Shell code in a project. It can be combined with Python and pytest reports too, but it takes some time to set everything up. So here I used a pytest that, from my own debugging, appears to cover all the branches and lines in the Bash code.

@kinow kinow force-pushed the detect-mpi-with-singularity branch 4 times, most recently from c9ac680 to 75eb43e Compare April 27, 2026 20:46
@kinow
Copy link
Copy Markdown
Member Author

kinow commented Apr 27, 2026

Tested the current version, and it's passing OK on MN5 🎉

}INFO Final process status is success

I moved the shared memory settings to MpiConfig from MPIRequirement. There are two new settings there, shm_dir=/dev/shm and shm_enabled=True.

Also added an initial test for the wrapper Bash script.

@mr-c, I think what's missing now are the unit tests to improve coverage of the code changed, and to pass the macos tests (I may have used some linux command that's not compatible, or some arg).

It shouldn't be too hard, I think I can get the build passing and all tests green this week.

@kinow kinow force-pushed the detect-mpi-with-singularity branch 9 times, most recently from c15ed27 to e0d9b20 Compare April 28, 2026 21:47
It uses a wrapper script to detect environment variables added
by an MPI launcher program such as mpirun or srun, and exports
them as SINGULARITYENV_$KEY=$VALUE.

Updates the MpiConfig of the MPIRequirement extension to add
the shared memory directory, and a flag to enable or disable
shared memory with Singularity (on by default). When enabled,
it maps a volume for the directory used (default /dev/shm).
@kinow
Copy link
Copy Markdown
Member Author

kinow commented Apr 28, 2026

@mr-c. I just tested this again on MareNostrum5, and it worked without issues 😌

The version of the Shell I had did not work on MacOS. First, it failed because a function is_apptainer_1_1_or_newer() fails when executed on MacOS (I think this was not called before in the tests). You can probably still see the log here: https://github.com/common-workflow-language/cwltool/actions/runs/25075581707/job/73467209090?pr=2216

To fix that I added a fake singularity binary in the $PATH for one of the tests, returning the fake script. That script is not used in Linux tests, as subprocess appears to find the real singularity first (it's probably searching the binary directly through another way, and if not found searching the path), but after doing that change the tests on macos-latest failed with a different error.

The macos tests now failed because the fake singularity was printing a random string, and the is_apptainer_.... functions called singularity --version, so I had to return a valid Singularity version. The failure log for get_version should still be available here, https://github.com/common-workflow-language/cwltool/actions/runs/25075581707/job/73467209090?pr=2216.

There's an atexit function to clean up the temporary file with the baseline env vars (used to diff with the mpirun envs). That function is not easy to test, and it has one statement wrapped in a contextlib.suppress(), so I marked that as # pragma: nocover to ignore that in coverage reports. Let me know if you prefer that to be marked as not covered instead (right now it's not red nor green in the codecov report; it appears as white like comments/etc.). The coverage CI check is still failing as there's a -0.17% diff (lines added/removed), but I think the test coverage should be OK.

Once, there was a failure in a conformance 1.3 test. Re-running with another commit (that was fixing linting errors), the test now passed.

There were also failures in the build_test_container job that I am not sure how to fix.

Oh, and about the singularity wrapper. Initially the code was using --env MISSING_VAR=SOME_VALUE, but I realized that the command line was really, really, really long on MareNostrum5, and we could have issues with commands being truncated or failing to run in some cases, I think (or terrible entries in logs). So I switched to SINGULARITYENV_$MISSING_VAR=SOMEVALUE. 👍

This should now be ready for review 😌

Cheers

@kinow kinow requested a review from mr-c April 28, 2026 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants