Skip to content

fix: respect CPU affinity (SLURM cgroups) when auto-detecting num_threads#236

Merged
LeonHafner merged 2 commits into
ArcInstitute:mainfrom
LeonHafner:leonhafner/fix_cpu_count
May 15, 2026
Merged

fix: respect CPU affinity (SLURM cgroups) when auto-detecting num_threads#236
LeonHafner merged 2 commits into
ArcInstitute:mainfrom
LeonHafner:leonhafner/fix_cpu_count

Conversation

@LeonHafner
Copy link
Copy Markdown
Collaborator

Summary

  • Fixes #235: MetricsEvaluator crashed under SLURM (--cpus-per-task=N) when the host had more CPUs than the job allocation, because mp.cpu_count() returns host-wide CPUs and ignores cgroup/cpuset limits.
    The oversized thread count was forwarded to numba.set_num_threads() via pdex, raising ValueError: The number of threads must be between 1 and N.
  • Replaces mp.cpu_count() with an affinity-aware _available_cpus() helper using os.sched_getaffinity(0), with a mp.cpu_count() fallback for macOS/Windows (those platforms don't run SLURM and have no cgroup caps in practice).
  • Resolves the num_threads == -1 sentinel at the top of MetricsEvaluator.__init__ so the resolved value is used everywhere downstream rather than only at the _build_de_comparison call site.
  • No new dependency required — os.sched_getaffinity is stdlib and matches the same source numba uses internally to size its threadpool.

@LeonHafner LeonHafner self-assigned this May 7, 2026
@LeonHafner LeonHafner added the bug Something isn't working label May 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a helper function _available_cpus to accurately determine the number of available CPUs by respecting process affinity on Linux, falling back to the total CPU count on other platforms. The MetricsEvaluator class was updated to utilize this function when num_threads is set to -1. Feedback suggests enhancing the robustness of the CPU detection by handling potential OSError exceptions in restricted environments and ensuring a minimum return value of 1.

Comment on lines +28 to +31
try:
return len(os.sched_getaffinity(0))
except AttributeError:
return mp.cpu_count()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is recommended to catch OSError in addition to AttributeError. In some restricted or containerized environments, os.sched_getaffinity might be present in the os module but blocked by security policies (e.g., seccomp), which would raise an OSError. Additionally, ensuring the returned value is at least 1 is a good defensive practice for downstream tools like Numba that require a positive thread count.

Suggested change
try:
return len(os.sched_getaffinity(0))
except AttributeError:
return mp.cpu_count()
try:
return max(1, len(os.sched_getaffinity(0)))
except (AttributeError, OSError):
return max(1, mp.cpu_count())

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's overengineering and doesn't really add much benefit for four new lines of code.

@LeonHafner
Copy link
Copy Markdown
Collaborator Author

Should we bump the version for this change, or just roll it in with the next version bump?

@LeonHafner
Copy link
Copy Markdown
Collaborator Author

Should we bump the version for this change, or just roll it in with the next version bump?

Since previous releases also mostly consisted of a single PR, I’ll also publish a cell-eval 0.7.2 release with this one.

@LeonHafner LeonHafner merged commit 44f9009 into ArcInstitute:main May 15, 2026
8 checks passed
@LeonHafner LeonHafner deleted the leonhafner/fix_cpu_count branch May 15, 2026 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MetricsEvaluator ignores CPU affinity (SLURM --cpus-per-task), causing numba ValueError

2 participants