Skip to content

Current regex checking gpu resource naming does not support NVIDIA MIGs #387

@dgruano

Description

@dgruano

Software Versions
$ snakemake -version
9.13.2
$ mamba list | grep "snakemake-executor-plugin-slurm"
snakemake-executor-plugin-slurm 1.9.2 pyhdfd78af_0 bioconda
snakemake-executor-plugin-slurm-jobstep 0.3.0 pyhdfd78af_0 bioconda
$ sinfo --version
slurm 24.11.5

Describe the bug
When providing a Multi-Instance GPU (MIG) device as the GPU to allocate, given the naming convention of these devices, the executor plugin raises a WorkFlow error due to the name not matching the appropiate regex pattern. This name contains a dot (e.g. gpu:nvidia_h100_nvl_1g.12gb:1), which is not in the accepted characters of the regex pattern. Switching the regex to gres_re = re.compile(r"^[a-zA-Z0-9_]+(:[a-zA-Z0-9_\.]+)?:\d+$") solves the issue. I don't know, however, if this fix would have any intended consequences.

Note: While not using the latest version of the plugin (2.0.2), the regex has not changed since #173 (9 months ago).

https://github.com/snakemake/snakemake-executor-plugin-slurm/blame/fb09d7fe7fe965daf1eae99afa2c67ad3836befc/snakemake_executor_plugin_slurm/utils.py#L179

Logs

(snakemake) [dgarcia@gemini01 paper]$ snakemake -s ./workflow/rules/gres.smk -p --verbose
Using workflow specific profile profiles/default for setting default command line arguments.
host: gemini01
Building DAG of jobs...
results/gres_bug_done: True 0
shared_storage_local_copies: True
remote_exec: False
Submitting maximum 100 job(s) over 1.0 second(s).
SLURM run ID: d5ded64e-cd8d-4968-ad89-68c2a9dc1985
Using shell: /usr/bin/bash
Provided remote nodes: 30
Job stats:
job         count
--------  -------
gres_bug        1
total           1

Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 30, '_job_count': 9223372036854775807}
Ready jobs: 1
Select jobs to execute...
Selected jobs: 1
Resources after job selection: {'_cores': 9223372036854775807, '_nodes': 30, '_job_count': 100}
Execute 1 jobs...

[Wed Dec 10 11:11:00 2025]
rule gres_bug:
    output: results/gres_bug_done
    jobid: 0
    reason: Missing output files: results/gres_bug_done
    resources: tmpdir=<TBD>, slurm_partition=gpu, gres=gpu:nvidia_h100_nvl_1g.12gb:1
Shell command: 
        echo "Testing gres resource allocation"
        touch results/gres_bug_done
        
No SLURM account given, trying to guess.
Unable to guess SLURM account. Trying to proceed without.
unlocking
removing lock
removing lock
removed all locks
Full Traceback (most recent call last):
  File "/home/dgarcia/miniforge3/envs/snakemake/lib/python3.13/site-packages/snakemake/cli.py", line 2187, in args_to_api
    dag_api.execute_workflow(
    ~~~~~~~~~~~~~~~~~~~~~~~~^
        executor=args.executor,
        ^^^^^^^^^^^^^^^^^^^^^^^
    ...<46 lines>...
        scheduler_settings=scheduler_settings,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/dgarcia/miniforge3/envs/snakemake/lib/python3.13/site-packages/snakemake/api.py", line 634, in execute_workflow
    workflow.execute(
    ~~~~~~~~~~~~~~~~^
        executor_plugin=executor_plugin,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
        updated_files=updated_files,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/dgarcia/miniforge3/envs/snakemake/lib/python3.13/site-packages/snakemake/workflow.py", line 1442, in execute
    raise e
  File "/home/dgarcia/miniforge3/envs/snakemake/lib/python3.13/site-packages/snakemake/workflow.py", line 1438, in execute
    success = self.scheduler.schedule()
  File "/home/dgarcia/miniforge3/envs/snakemake/lib/python3.13/site-packages/snakemake/scheduling/job_scheduler.py", line 389, in schedule
    raise e
  File "/home/dgarcia/miniforge3/envs/snakemake/lib/python3.13/site-packages/snakemake/scheduling/job_scheduler.py", line 367, in schedule
    self.run(runjobs)
    ~~~~~~~~^^^^^^^^^
  File "/home/dgarcia/miniforge3/envs/snakemake/lib/python3.13/site-packages/snakemake/scheduling/job_scheduler.py", line 496, in run
    executor.run_jobs(jobs)
    ~~~~~~~~~~~~~~~~~^^^^^^
  File "/home/dgarcia/miniforge3/envs/snakemake/lib/python3.13/site-packages/snakemake_interface_executor_plugins/executors/base.py", line 73, in run_jobs
    self.run_job(job)
    ~~~~~~~~~~~~^^^^^
  File "/home/dgarcia/miniforge3/envs/snakemake/lib/python3.13/site-packages/snakemake_executor_plugin_slurm/__init__.py", line 328, in run_job
    call += set_gres_string(job)
            ~~~~~~~~~~~~~~~^^^^^
  File "/home/dgarcia/miniforge3/envs/snakemake/lib/python3.13/site-packages/snakemake_executor_plugin_slurm/utils.py", line 96, in set_gres_string
    raise WorkflowError(
    ...<3 lines>...
    )
snakemake_interface_common.exceptions.WorkflowError: Invalid GRES format: gpu:nvidia_h100_nvl_1g.12gb:1. Expected format: '<name>:<number>' or '<name>:<type>:<number>' (e.g., 'gpu:1' or 'gpu:tesla:2')

WorkflowError:
Invalid GRES format: gpu:nvidia_h100_nvl_1g.12gb:1. Expected format: '<name>:<number>' or '<name>:<type>:<number>' (e.g., 'gpu:1' or 'gpu:tesla:2')

Minimal example

rule gres_bug:
    output:
        touch("results/gres_bug_done")
    resources:
        slurm_partition="gpu",
        gres="gpu:nvidia_h100_nvl_1g.12gb:1",
    shell:
        """
        echo "Testing gres resource allocation"
        touch {output}
        """

Additional context
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/mig-device-names.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions