Commit 0730a9a
fix: avoid exhausting Ray CPUs in Tuner by using PlacementGroupFactory with single head bundle (#191)
# Summary
`Tuner` was reserving 0.5 CPU per component via `PlacementGroupFactory`,
exhausting Ray's default CPU pool (= number of virtual cores) when
running concurrent trials. This required manual `ray.init(num_cpus=100)`
workarounds.
# Changes
- Changed `PlacementGroupFactory` from reserving 0.5 CPU per component
to reserving only 0.5 CPU for the tune process (head bundle)
- Removed component worker bundles to avoid exhausting Ray CPUs and to
prevent Ray placement group cleanup issues
- Maintains TODO for future per-component resource specification
```python
# Before
ray.tune.PlacementGroupFactory(
[{"CPU": 0.5}] + [{"CPU": 0.5}] * len(spec.args.components)
)
# After
ray.tune.PlacementGroupFactory(
[{"CPU": 0.5}] # Only head bundle for tune process
)
```
**Note**: Using `[{"CPU": 0.5}] + [{"CPU": 0}] *
len(spec.args.components)` causes a Ray internal error (`KeyError: 'pop
from an empty set'`) during placement group cleanup because Ray filters
out 0-value resources, creating empty bundles that trigger a bug in
Ray's resource management.
<!-- START COPILOT ORIGINAL PROMPT -->
<details>
<summary>Original prompt</summary>
>
> ----
>
> *This section details on the original issue you should resolve*
>
> <issue_title>bug: Running Tuner can exhaust Ray CPUs</issue_title>
> <issue_description>### Summary
>
> Currently the `Tuner` reserves 0.5 CPU for the tune process and 0.5
CPU for each Component. Running a tune job with concurrency can easily
exhaust the available Ray CPUs (which defaults to the number of virtual
cores).
>
> A workaround is to manually call `ray.init` with a larger CPU count,
e.g. `ray.init(num_cpus=100)` to ensure the cluster contains enough
CPUs. For now we could avoid having to do this by instead reserving `0`
CPUs per component.
>
> Eventually we should plan to allow users to specify (non-zero)
resource requirements on each component, so that we can make proper use
of resource allocation in Ray.
>
> ### Platform
>
> Linux 6.6.84.1-microsoft-standard-WSL2 x86_64 GNU/Linux
>
> ### Version
>
> 0.4.0
>
> ### Python version
>
> Python 3.12.11</issue_description>
>
> ## Comments on the Issue (you are @copilot in this section)
>
> <comments>
> </comments>
>
</details>
<!-- START COPILOT CODING AGENT SUFFIX -->
- Fixes #190
<!-- START COPILOT CODING AGENT TIPS -->
---
💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: toby-coleman <13170610+toby-coleman@users.noreply.github.com>1 parent 49a35bc commit 0730a9a
1 file changed
Lines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
226 | 226 | | |
227 | 227 | | |
228 | 228 | | |
229 | | - | |
| 229 | + | |
230 | 230 | | |
231 | | - | |
| 231 | + | |
232 | 232 | | |
233 | 233 | | |
234 | 234 | | |
| |||
0 commit comments