fix: avoid exhausting Ray CPUs in Tuner by using PlacementGroupFactory with single head bundle (#191)

Copilot · toby-coleman · web-flow · commit 0730a9ae56f3 · 2026-01-26T17:50:07.000Z
# Summary `Tuner` was reserving 0.5 CPU per component via `PlacementGroupFactory`, exhausting Ray's default CPU pool (= number of virtual cores) when running concurrent trials. This required manual `ray.init(num_cpus=100)` workarounds. # Changes - Changed `PlacementGroupFactory` from reserving 0.5 CPU per component to reserving only 0.5 CPU for the tune process (head bundle) - Removed component worker bundles to avoid exhausting Ray CPUs and to prevent Ray placement group cleanup issues - Maintains TODO for future per-component resource specification ```python # Before ray.tune.PlacementGroupFactory( [{"CPU": 0.5}] + [{"CPU": 0.5}] * len(spec.args.components) ) # After ray.tune.PlacementGroupFactory( [{"CPU": 0.5}] # Only head bundle for tune process ) ``` **Note**: Using `[{"CPU": 0.5}] + [{"CPU": 0}] * len(spec.args.components)` causes a Ray internal error (`KeyError: 'pop from an empty set'`) during placement group cleanup because Ray filters out 0-value resources, creating empty bundles that trigger a bug in Ray's resource management.  <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>bug: Running Tuner can exhaust Ray CPUs</issue_title> > <issue_description>### Summary > > Currently the `Tuner` reserves 0.5 CPU for the tune process and 0.5 CPU for each Component. Running a tune job with concurrency can easily exhaust the available Ray CPUs (which defaults to the number of virtual cores). > > A workaround is to manually call `ray.init` with a larger CPU count, e.g. `ray.init(num_cpus=100)` to ensure the cluster contains enough CPUs. For now we could avoid having to do this by instead reserving `0` CPUs per component. > > Eventually we should plan to allow users to specify (non-zero) resource requirements on each component, so that we can make proper use of resource allocation in Ray. > > ### Platform > > Linux 6.6.84.1-microsoft-standard-WSL2 x86_64 GNU/Linux > > ### Version > > 0.4.0 > > ### Python version > > Python 3.12.11</issue_description> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details>  - Fixes #190  --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: toby-coleman <13170610+toby-coleman@users.noreply.github.com>
diff --git a/plugboard/tune/tune.py b/plugboard/tune/tune.py
@@ -226,9 +226,9 @@ def run(self, spec: ProcessSpec) -> ray.tune.Result | list[ray.tune.Result]:
         trainable_with_resources = ray.tune.with_resources(
             self._build_objective(required_classes, spec),
             ray.tune.PlacementGroupFactory(
-                # Reserve 0.5 CPU for the tune process and 0.5 CPU for each component in the Process
+                # Reserve 0.5 CPU for the tune process and 0 CPU for each component in the Process
                 # TODO: Implement better resource allocation based on Process requirements
-                [{"CPU": 0.5}] + [{"CPU": 0.5}] * len(spec.args.components),
+                [{"CPU": 0.5}],
             ),
         )