Conversation
d2b1e14 to
bff8585
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2216 +/- ##
==========================================
+ Coverage 85.12% 85.25% +0.13%
==========================================
Files 46 46
Lines 8591 8614 +23
Branches 2011 2014 +3
==========================================
+ Hits 7313 7344 +31
+ Misses 812 806 -6
+ Partials 466 464 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Test successful on BSC MN5! 🎉
For posterity, if you run cwltool on an HPC using multiple nodes, but they are not sharing the same temp directory, you'll get something similar to: slurmstepd: error: couldn't chdir to `/scratch/tmp/37435333/l0q96kcm': No such file or directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/scratch/tmp/37435333/l0q96kcm': No such file or directory: going to /tmp instead
WARNING: skipping mount of /scratch/tmp/37435333/1db6cool: stat /scratch/tmp/37435333/1db6cool: no such file or directory
WARNING: skipping mount of /scratch/tmp/37435333/e9hzplsn/a.out: stat /scratch/tmp/37435333/e9hzplsn/a.out: no such file or directory
WARNING: skipping mount of /scratch/tmp/37435333/1db6cool: stat /scratch/tmp/37435333/1db6cool: no such file or directory
WARNING: skipping mount of /scratch/tmp/37435333/e9hzplsn/a.out: stat /scratch/tmp/37435333/e9hzplsn/a.out: no such file or directory
FATAL: container creation failed: mount /scratch/tmp/37435333/1db6cool->/dev/shm error: while mounting /scratch/tmp/37435333/1db6cool: mount source /scratch/tmp/37435333/1db6cool doesn't exist
FATAL: container creation failed: mount /scratch/tmp/37435333/1db6cool->/dev/shm error: while mounting /scratch/tmp/37435333/1db6cool: mount source /scratch/tmp/37435333/1db6cool doesn't existAnd if you run without network access: Abort(1017768207) on node 2:
Fatal error in internal_Init:
Other MPI error, error stack:
internal_Init(70)....................:
MPI_Init(argc=0x7ffd98ef954c, argv=0x7ffd98ef9540) failed MPII_Init_thread(282)................:
MPIR_init_comm_world(34).............:
MPIR_Comm_commit(794)................:
MPIR_Comm_commit_internal(579).......:
MPID_Comm_commit_pre_hook(151).......:
MPIDI_world_pre_init(669)............:
MPIDI_OFI_init_world(805)............:
MPIDI_OFI_addr_exchange_root_ctx(143):
MPIDU_bc_allgather(112)..............:
MPIR_Allgatherv_intra_brucks(80).....:
MPIC_Sendrecv(259)...................:
MPID_Isend(60).......................:
MPIDI_isend(32)......................:
MPIDI_NM_mpi_isend(780)..............:
MPIDI_OFI_send_fallback(483).........:
OFI call tsendv failed (default nic=lo:
No such file or directory) Abort(1017768207) on node 0:
Fatal error in internal_Init:
Other MPI error, error stack:
internal_Init(70)....................:
MPI_Init(argc=0x7ffe4702f11c, argv=0x7ffe4702f110) failed MPII_Init_thread(282)................:
MPIR_init_comm_world(34).............:
MPIR_Comm_commit(794)................:
MPIR_Comm_commit_internal(579).......:
MPID_Comm_commit_pre_hook(151).......:
MPIDI_world_pre_init(669)............:
MPIDI_OFI_init_world(805)............:
MPIDI_OFI_addr_exchange_root_ctx(143):
MPIDU_bc_allgather(112)..............:
MPIR_Allgatherv_intra_brucks(80).....:
MPIC_Sendrecv(259)...................:
MPID_Isend(60).......................:
MPIDI_isend(32)......................:
MPIDI_NM_mpi_isend(780)..............:
MPIDI_OFI_send_fallback(483).........:
OFI call tsendv failed (default nic=lo:
No such file or directory)Using a temporary directory pointing to the parallel file system (GPFS in the case of MN5) and adding the |
|
Test successful on CESGA FT3! 🎉
I used a folder from Lustre parallel filesystem, |
| if not mpi_req or not is_req: | ||
| runtime.append("--cleanenv") |
There was a problem hiding this comment.
Thank you for this PR! Lets always use --cleanenv, to cope with the environment variables sets I recommend creating a shell script to be executed instead of the main command line.
The environment variables set can be derived from either parsing env_pass and env_pass_regex from the --mpi-config-file OR pass in a list of the existing environment variable names and only pass those newly set by the MPI runner.
There was a problem hiding this comment.
I've added --cleanenv back, and added a wrapper script. I will need some time now to test it, but in the meantime feel free to have a look at the code here and at the new Shell script to see if we are going in the right direction :) 👍
| else: | ||
| self.append_volume( | ||
| runtime, | ||
| runtime_context.create_tmpdir(), | ||
| "/dev/shm", | ||
| writable=True, |
There was a problem hiding this comment.
Neat trick; is this universal for all known MPI implementations? Or do some not use the shared memory device for communication between MPI processes?
There was a problem hiding this comment.
That's not defined in the MPI specification. OpenMPI uses shared memory. MPICH also uses shared memory. Cray and Intel are derivatives from MPICH so they likely use it by default too. It appears there are some environment variables to disable it, but I haven't seen any case where shared memory was disabled for MPI. If you think it's useful I can add a new option to disable it in cwltool, or we can do it later if a user requests it?
29259b3 to
fcc3175
Compare
| runtime = [ | ||
| "singularity", | ||
| mpi_req, is_req = self.builder.get_requirement(MPIRequirementName) | ||
| mpi_enabled = mpi_req and is_req |
There was a problem hiding this comment.
I think the is_req excludes if MPIRequirement is used as a hint. I'm considering that MPI is used/enabled only if it's a requirement, and not a hint.
603e6d2 to
b09bfbd
Compare
| self.shm_enabled = shm_enabled | ||
| # POSIX only contains functions to handle shared memory, but it does not | ||
| # specify the directory to be used, nor if a directory needs to be used | ||
| # at all -- ref: https://pubs.opengroup.org/onlinepubs/9699919799/ |
There was a problem hiding this comment.
Looks like FreeBSD uses objects instead of files as in Linux:
These two, shm_enabled and shm_dir, are used to control whether the shared memory will be used and the directory to be used, respect/.
And MPICH allows users to disable the shared memory completely, or change its size. https://github.com/pmodels/mpich/blob/main/doc/wiki/faq/Frequently_Asked_Questions.md#:~:text=The%20work,stack
So with these settings I think users are able to match the behavior of that implementation at least.
|
|
||
| This script tests a Shell script. This script does not contribute to the | ||
| project test coverage (although kcov, or bats+kcov could be used in the | ||
| future). |
There was a problem hiding this comment.
I use kcov with bats to test and track coverage of Shell code in a project. It can be combined with Python and pytest reports too, but it takes some time to set everything up. So here I used a pytest that, from my own debugging, appears to cover all the branches and lines in the Bash code.
c9ac680 to
75eb43e
Compare
|
Tested the current version, and it's passing OK on MN5 🎉 I moved the shared memory settings to Also added an initial test for the wrapper Bash script. @mr-c, I think what's missing now are the unit tests to improve coverage of the code changed, and to pass the macos tests (I may have used some linux command that's not compatible, or some arg). It shouldn't be too hard, I think I can get the build passing and all tests green this week. |
c15ed27 to
e0d9b20
Compare
It uses a wrapper script to detect environment variables added by an MPI launcher program such as mpirun or srun, and exports them as SINGULARITYENV_$KEY=$VALUE. Updates the MpiConfig of the MPIRequirement extension to add the shared memory directory, and a flag to enable or disable shared memory with Singularity (on by default). When enabled, it maps a volume for the directory used (default /dev/shm).
|
@mr-c. I just tested this again on MareNostrum5, and it worked without issues 😌 The version of the Shell I had did not work on MacOS. First, it failed because a function To fix that I added a fake The macos tests now failed because the fake singularity was printing a random string, and the There's an atexit function to clean up the temporary file with the baseline env vars (used to diff with the mpirun envs). That function is not easy to test, and it has one statement wrapped in a Once, there was a failure in a conformance 1.3 test. Re-running with another commit (that was fixing linting errors), the test now passed. There were also failures in the build_test_container job that I am not sure how to fix. Oh, and about the singularity wrapper. Initially the code was using This should now be ready for review 😌 Cheers |
Related to #2208 (not sure if this is enough to close that issue)
@mr-c, I started some tests yesterday, and completed them today after work. Once I managed to run the workflow on my machine reproducing the error we have on HPCs, debugging it showed that the changes appeared simple to be made.
I managed to keep the
--contain, but in order to do that I had to include/dev/shm. I couldn't find any files on/tmpor on the temporary directory created by cwltool. And actually ,cwltoolworked without having to remove the temporary directory created by cwltool (i.e. I was wrong, and the process manager wasn't using the temp dir to sync with processes, but apparently shared memory).I kept the
--cleanenvfor when the job isn't using the cwltoolMPIRequirement. When that requirement is a requirement (not a hint) and is used in the job, then I do not set the--cleanenv.I tried keeping it, and even tested the MPI config file using the variables from the MPICH documentation.
(The MPI config file was needed on my laptop as I have OpenMPI & MPICH, so I need to override the runner as mpirun defaults to OpenMPI on my machine, and the container I'm using has MPICH -- I also confirmed with the debugger that the MPI config file values were added)
The issue with the environment variables, @mr-c, is that when
cwltoolruns, I don't have any environment variables yet.When we create the CWL job, the runtime config, and even when we pass through the env vars from
mpi.py, there are no MPI variables loaded yet.The Edinburgh paper discusses how to use some variables, but I believe those are only the variable already available. For instance, when you have a Slurm allocation and you have
SLURM_JOB_ID,SLURM_MEM_PER_CPU, etc.When the CWL job is executed, already with the runtime (command + args) created, and then Python launches
mpirun.mpich, that's when the MPICH (or Hydra) program will create the environment variables.I tried setting these variables in the command line of singularity without success. This is what I did:
Got the list of variables that appeared in the diff. There were 8 extra variables:
GFORTRAN_UNBUFFERED_PRECONNECTED=yHYDI_CONTROL_FD=7MPIR_CVAR_CH3_INTERFACE_HOSTNAME=ranmaMPI_LOCALNRANKS=1MPI_LOCALRANKID=0PMI_FD=9PMI_RANK=0PMI_SIZE=1I added them to the Singularity command with just their names to see if it would pass the host env vars to the container (as it wouldn't make sense trying to define random file descriptor, ranks, etc.).
The resulting command has the
--cleanenvand all the environment variables from the diff:INFO [job run] /tmp/hib1sp8z$ mpirun.mpich \ -n \ 2 \ singularity \ --quiet \ run \ --ipc \ --contain \ --cleanenv \ --env \ PMI_RANK \ --env \ PMI_FD \ --env \ PMI_SIZE \ --env \ MPI_LOCALNRANKS \ --env \ MPI_LOCALRANKID \ --env \ MPIR_CVAR_CH3_INTERFACE_HOSTNAME \ --env \ HYDI_CONTROL_FD \ --env \ GFORTRAN_UNBUFFERED_PRECONNECTED \ --mount=type=bind,source=/tmp/g7lo1zku,target=/dev/shm \ --no-eval \ --userns \ --home \ /tmp/hib1sp8z:/PCIKJh \ --mount=type=bind,source=/tmp/ub89clkk,target=/tmp \ --mount=type=bind,source=/tmp/h8j2gy1j/a.out,target=/var/lib/cwl/stgfb9c3552-0c10-424c-82ba-284742e69287/a.out,readonly \ --pwd \ /PCIKJh \ --net \ --network \ none \ /home/kinow/Development/python/workspace/cwl-mpi/examples/mpich-sr/cwltool/mfisherman_mpich:4.3.2.sif \ /var/lib/cwl/stgfb9c3552-0c10-424c-82ba-284742e69287/a.out \ 0 \ 1 > /tmp/hib1sp8z/sr.out 2> /tmp/hib1sp8z/sr.err WARNING [job run] exited with status: 1 WARNING [job run] completed permanentFail WARNING [step run] completed permanentFailThat failed again. And running a smaller test looks like only key=value is valid.
But since I don't know the file descriptor, rank, etc., I can't see a way to define PMI_RANK, PMI_SIZE, PMI_FD, and other variables for MPICH.
@mr-c I did not write tests as I wanted to check with you first if this is going in the right direction. I'll run a couple of tests on MN5 and CESGA FT3. My main concern is
--containand InfiniBand. Even without--cleanenv, I think--containmay force the container to use ethernet instead of InfiniBand.