Skip to content

Commit 0730a9a

Browse files
fix: avoid exhausting Ray CPUs in Tuner by using PlacementGroupFactory with single head bundle (#191)
# Summary `Tuner` was reserving 0.5 CPU per component via `PlacementGroupFactory`, exhausting Ray's default CPU pool (= number of virtual cores) when running concurrent trials. This required manual `ray.init(num_cpus=100)` workarounds. # Changes - Changed `PlacementGroupFactory` from reserving 0.5 CPU per component to reserving only 0.5 CPU for the tune process (head bundle) - Removed component worker bundles to avoid exhausting Ray CPUs and to prevent Ray placement group cleanup issues - Maintains TODO for future per-component resource specification ```python # Before ray.tune.PlacementGroupFactory( [{"CPU": 0.5}] + [{"CPU": 0.5}] * len(spec.args.components) ) # After ray.tune.PlacementGroupFactory( [{"CPU": 0.5}] # Only head bundle for tune process ) ``` **Note**: Using `[{"CPU": 0.5}] + [{"CPU": 0}] * len(spec.args.components)` causes a Ray internal error (`KeyError: 'pop from an empty set'`) during placement group cleanup because Ray filters out 0-value resources, creating empty bundles that trigger a bug in Ray's resource management. <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>bug: Running Tuner can exhaust Ray CPUs</issue_title> > <issue_description>### Summary > > Currently the `Tuner` reserves 0.5 CPU for the tune process and 0.5 CPU for each Component. Running a tune job with concurrency can easily exhaust the available Ray CPUs (which defaults to the number of virtual cores). > > A workaround is to manually call `ray.init` with a larger CPU count, e.g. `ray.init(num_cpus=100)` to ensure the cluster contains enough CPUs. For now we could avoid having to do this by instead reserving `0` CPUs per component. > > Eventually we should plan to allow users to specify (non-zero) resource requirements on each component, so that we can make proper use of resource allocation in Ray. > > ### Platform > > Linux 6.6.84.1-microsoft-standard-WSL2 x86_64 GNU/Linux > > ### Version > > 0.4.0 > > ### Python version > > Python 3.12.11</issue_description> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes #190 <!-- START COPILOT CODING AGENT TIPS --> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: toby-coleman <13170610+toby-coleman@users.noreply.github.com>
1 parent 49a35bc commit 0730a9a

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

plugboard/tune/tune.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -226,9 +226,9 @@ def run(self, spec: ProcessSpec) -> ray.tune.Result | list[ray.tune.Result]:
226226
trainable_with_resources = ray.tune.with_resources(
227227
self._build_objective(required_classes, spec),
228228
ray.tune.PlacementGroupFactory(
229-
# Reserve 0.5 CPU for the tune process and 0.5 CPU for each component in the Process
229+
# Reserve 0.5 CPU for the tune process and 0 CPU for each component in the Process
230230
# TODO: Implement better resource allocation based on Process requirements
231-
[{"CPU": 0.5}] + [{"CPU": 0.5}] * len(spec.args.components),
231+
[{"CPU": 0.5}],
232232
),
233233
)
234234

0 commit comments

Comments
 (0)