Skip to content

Rebuild GPU software for all supported combinations of CPU and CUDA compute capabilities#969

Closed
ocaisa wants to merge 4 commits intoEESSI:2023.06-software.eessi.iofrom
ocaisa:rebuild_gpu_packages
Closed

Rebuild GPU software for all supported combinations of CPU and CUDA compute capabilities#969
ocaisa wants to merge 4 commits intoEESSI:2023.06-software.eessi.iofrom
ocaisa:rebuild_gpu_packages

Conversation

@ocaisa
Copy link
Copy Markdown
Member

@ocaisa ocaisa commented Mar 17, 2025

We're not going to able to test all possible CPU/GPU combinations, but we need a general approach to allow us to move forward with only testing a subset while providing more possibilities. This PR is intended to put this workflow in place and begin rebuilding all GPU packages to reflect the changes.

  • Target a specific set of CPU/compute-capability combinations
    • 7.0, 8.0, 9.0 for all CPU architectures
    • Major version CC device code will run on all minor versions (e.g., 8.0 device code on 8.6-capable GPU) so allow for this fallback
    • Allow for a specific set of additional CPU/cc combinations for known hardware
    • Document these combinations in the EasyBuild hook and complain if they are not being respected
  • Modify EasyBuild hooks:
    • to reflect whether module has been tested or not: check for an available device matching the target cc, if none exist add an Lmod footer or module description explaining this
    • to fail when attempting to install a package that is not CUDA or has no CUDA dependency into an accel subdir (with advice about what to do)
    • to enforce compilation for device code (with ptx) by setting NVCC_PREPEND_FLAGS='-arch=sm_XX' for the build (this probably has falllout so should be considered as a nice-to-have)

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphirerapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

@eessi-bot-trz42
Copy link
Copy Markdown

Instance trz42-GH200-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@eessi-bot-toprichard
Copy link
Copy Markdown

Instance rt-Grace-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@ocaisa
Copy link
Copy Markdown
Member Author

ocaisa commented Mar 17, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

    • no jobs were submitted

@eessi-bot-toprichard
Copy link
Copy Markdown

Updates by the bot instance rt-Grace-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

@eessi-bot-trz42
Copy link
Copy Markdown

Updates by the bot instance trz42-GH200-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50733

date job status comment
Mar 17 16:37:55 UTC 2025 submitted job id 50733 awaits release by job manager
Mar 17 16:38:56 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 16:45:02 UTC 2025 running job 50733 is running
Mar 17 17:27:54 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-50733.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742231250.tar.gzsize: 0 MiB (3706 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 17:27:54 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-50733.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc75 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50734

date job status comment
Mar 17 16:38:00 UTC 2025 submitted job id 50734 awaits release by job manager
Mar 17 16:38:54 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 16:43:59 UTC 2025 running job 50734 is running
Mar 17 16:48:08 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-50734.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742229971.tar.gzsize: 0 MiB (3705 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 16:48:08 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-50734.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa ocaisa changed the title Rebuild GPU software to support all CPUs and all support CUDA compute capabilities Rebuild GPU software to support all CPUs and all CUDA compute capabilities Mar 17, 2025
@ocaisa
Copy link
Copy Markdown
Member Author

ocaisa commented Mar 17, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

    • no jobs were submitted

@eessi-bot-toprichard
Copy link
Copy Markdown

Updates by the bot instance rt-Grace-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

@eessi-bot-trz42
Copy link
Copy Markdown

Updates by the bot instance trz42-GH200-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50735

date job status comment
Mar 17 18:34:05 UTC 2025 submitted job id 50735 awaits release by job manager
Mar 17 18:35:04 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 18:41:09 UTC 2025 running job 50735 is running
Mar 17 18:44:16 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-50735.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742236924.tar.gzsize: 0 MiB (3704 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 18:44:16 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-50735.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc75 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50736

date job status comment
Mar 17 18:34:10 UTC 2025 submitted job id 50736 awaits release by job manager
Mar 17 18:35:02 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 18:41:07 UTC 2025 running job 50736 is running
Mar 17 18:44:14 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-50736.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742236952.tar.gzsize: 0 MiB (3704 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 18:44:14 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-50736.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa
Copy link
Copy Markdown
Member Author

ocaisa commented Mar 17, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

    • no jobs were submitted

@eessi-bot-trz42
Copy link
Copy Markdown

Updates by the bot instance trz42-GH200-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

@eessi-bot-toprichard
Copy link
Copy Markdown

Updates by the bot instance rt-Grace-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50737

date job status comment
Mar 17 19:03:59 UTC 2025 submitted job id 50737 awaits release by job manager
Mar 17 19:04:22 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 19:05:26 UTC 2025 running job 50737 is running
Mar 17 21:48:09 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-50737.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742245284.tar.gzsize: 5342 MiB (5602332191 bytes)
entries: 16543
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
ESPResSo/4.2.2-foss-2023a-CUDA-12.1.1.lua
LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1.lua
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0.lua
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
CUDA/12.1.1
CUDA/12.4.0
cuDNN/8.9.2.26-CUDA-12.1.1
ESPResSo/4.2.2-foss-2023a-CUDA-12.1.1
LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 21:48:09 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-50737.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 17, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc75 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/50738

date job status comment
Mar 17 19:04:04 UTC 2025 submitted job id 50738 awaits release by job manager
Mar 17 19:04:20 UTC 2025 released job awaits launch by Slurm scheduler
Mar 17 19:05:24 UTC 2025 running job 50738 is running
Mar 17 21:34:54 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-50738.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742244793.tar.gzsize: 5453 MiB (5718189739 bytes)
entries: 16639
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
ESPResSo/4.2.2-foss-2023a-CUDA-12.1.1.lua
LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1.lua
NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1.lua
NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0.lua
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0.lua
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/software
CUDA/12.1.1
CUDA/12.4.0
cuDNN/8.9.2.26-CUDA-12.1.1
ESPResSo/4.2.2-foss-2023a-CUDA-12.1.1
LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1
NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1
NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 17 21:34:54 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-50738.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

Comment thread create_lmodsitepackage.py
Comment on lines +112 to +131
local function using_eessi_accel_stack (t)
if not os.getenv("EESSI_SKIP_ACCELERATOR_WARNING") then
local fullName = t.modFullName
local moduleFilePath = t.fn
-- Check if we are using an EESSI version 2023 accelerator stack by checking the moduleFilePath is
-- a path that starts with /cvmfs/software.eessi.io/versions and contains accel/nvidia/ccNN
if string.sub(moduleFilePath, 1, 33) == "/cvmfs/software.eessi.io/versions" then
if string.find(moduleFilePath, "accel/nvidia/cc%d%d") then
-- right now we print this for all cases, but eventually we should only
-- print this for accelerators we do _not_ test
local advice = fullName .. " has not been tested for " .. os.getenv("EESSI_SOFTWARE_SUBDIR")
advice = advice .. " with " .. string.match(moduleFilePath, "accel/nvidia/cc%d%d")
advice = advice .. " but is likely to work.\\n"
advice = advice .. "(Silence this message by setting the environment variable "
advice = advice .. "EESSI_SKIP_ACCELERATOR_WARNING)"
LmodMessage(advice)
end
end
end
end
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to an EasyBuild hook instead (as what we can test on may change over time).

@ocaisa
Copy link
Copy Markdown
Member Author

ocaisa commented Mar 18, 2025

Not sure how this is possible, but EasyBuild is failing to apply the patch to the GROMACS sources:

== FAILED: Installation ended unsuccessfully (build directory: /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0): build failed (first 300 chars): Can't determine patch level for patch /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/EasyBuild/4.9.4/easybuild/easyconfigs/g/GROMACS/GROMACS-2023.1_set_omp_num_threads_env_for_ntomp_tests.patch from directory /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA- (took 12 mins 15 secs)

@ocaisa
Copy link
Copy Markdown
Member Author

ocaisa commented Mar 18, 2025

I don't get it, if I do it manually it works just fine

@boegel
Copy link
Copy Markdown
Contributor

boegel commented Mar 20, 2025

@ocaisa Can you clarify in the PR description why these rebuilds are necessary? What has changed to require us rebuilding all of this?

@boegel
Copy link
Copy Markdown
Contributor

boegel commented Mar 20, 2025

Not sure how this is possible, but EasyBuild is failing to apply the patch to the GROMACS sources:

== FAILED: Installation ended unsuccessfully (build directory: /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0): build failed (first 300 chars): Can't determine patch level for patch /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/software/EasyBuild/4.9.4/easybuild/easyconfigs/g/GROMACS/GROMACS-2023.1_set_omp_num_threads_env_for_ntomp_tests.patch from directory /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA- (took 12 mins 15 secs)

I think it's because the "start" dir is wrong somehow, it should be /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0/gromacs-2024.4 rather than /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0?

@boegel
Copy link
Copy Markdown
Contributor

boegel commented Mar 20, 2025

== 2025-03-17 20:37:47,651 filetools.py:461 DEBUG Unpacking /project/def-users/bot/shared/easybuild/sources/g/GROMACS/gromacs-2024.4.tar.gz in directory /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0
...
== 2025-03-17 20:37:47,651 run.py:222 DEBUG run_cmd: running cmd tar xzf /project/def-users/bot/shared/easybuild/sources/g/GROMACS/gromacs-2024.4.tar.gz (in /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0)
...
== 2025-03-17 20:37:48,806 filetools.py:1373 DEBUG Last dir list ['gromacs-2024.4', 'easybuild_obj']
== 2025-03-17 20:37:48,806 filetools.py:1374 DEBUG Possible new dir /tmp/bot/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0 found

The easybuild_obj is causing trouble, it's breaking the detection of which directory got unpacked from the source tarball, so gromacs-2024.4 is not marked as start dir as it should be.

@boegel
Copy link
Copy Markdown
Contributor

boegel commented Mar 20, 2025

GROMACS is an iterated installation, and applying the patch is failing in the 2nd iteration because the build directory is not fully removed when 2nd iteration starts (the easybuild_obj directory from 1st iteration stays in place).

But I don't see how this is only a problem now... Maybe we somehow break the cleanup through our hooks?

@ocaisa ocaisa changed the title Rebuild GPU software to support all CPUs and all CUDA compute capabilities Rebuild GPU software to support all supported combinations of CPU and CUDA compute capabilities Mar 20, 2025
@ocaisa ocaisa marked this pull request as draft March 20, 2025 10:01
@ocaisa
Copy link
Copy Markdown
Member Author

ocaisa commented Mar 20, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 20, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 20, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc80 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • received bot command build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen2 accel:nvidia/cc75 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted
  • handling command build repository:eessi.io-2023.06-software instance:eessi-bot-mc-aws architecture:x86_64/amd/zen2 accelerator:nvidia/cc75 resulted in:

    • no jobs were submitted

@eessi-bot-trz42
Copy link
Copy Markdown

Updates by the bot instance trz42-GH200-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

@eessi-bot-toprichard
Copy link
Copy Markdown

Updates by the bot instance rt-Grace-jr (click for details)
  • account ocaisa has NO permission to send commands to the bot

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 20, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/51393

date job status comment
Mar 20 10:09:03 UTC 2025 submitted job id 51393 awaits release by job manager
Mar 20 10:09:31 UTC 2025 released job awaits launch by Slurm scheduler
Mar 20 10:15:43 UTC 2025 running job 51393 is running
Mar 20 12:18:14 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-51393.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742470395.tar.gzsize: 5076 MiB (5323445039 bytes)
entries: 11996
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0.lua
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
CUDA/12.1.1
CUDA/12.4.0
cuDNN/8.9.2.26-CUDA-12.1.1
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 20 12:18:14 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-51393.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot
Copy link
Copy Markdown

eessi-bot Bot commented Mar 20, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc75 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.03/pr_969/51394

date job status comment
Mar 20 10:09:07 UTC 2025 submitted job id 51394 awaits release by job manager
Mar 20 10:09:28 UTC 2025 released job awaits launch by Slurm scheduler
Mar 20 10:10:33 UTC 2025 running job 51394 is running
Mar 20 11:52:34 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-51394.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1742469302.tar.gzsize: 5187 MiB (5439704863 bytes)
entries: 12092
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1.lua
NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0.lua
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0.lua
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75/software
CUDA/12.1.1
CUDA/12.4.0
cuDNN/8.9.2.26-CUDA-12.1.1
NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1
NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0
OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0
UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1
UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc75
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Mar 20 11:52:34 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-51394.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa ocaisa changed the title Rebuild GPU software to support all supported combinations of CPU and CUDA compute capabilities Rebuild GPU software for all supported combinations of CPU and CUDA compute capabilities Mar 20, 2025
@laraPPr
Copy link
Copy Markdown
Collaborator

laraPPr commented Jun 27, 2025

@ocaisa Can you split up and retarget this pr?

@ocaisa
Copy link
Copy Markdown
Member Author

ocaisa commented Jun 30, 2025

I think we will just come back to doing this naturally in the near future

@ocaisa ocaisa closed this Jun 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants