Skip to content

Commit 083212d

Browse files
RHOAIENG-50554: Make kueue optional for RayJob
1 parent fa52c07 commit 083212d

3 files changed

Lines changed: 174 additions & 167 deletions

File tree

Lines changed: 146 additions & 143 deletions
Original file line numberDiff line numberDiff line change
@@ -1,146 +1,149 @@
11
{
2-
"cells": [
3-
{
4-
"cell_type": "markdown",
5-
"id": "9259e514",
6-
"metadata": {},
7-
"source": [
8-
"# Submitting a RayJob CR\n",
9-
"\n",
10-
"In this notebook, we will go through the basics of using the SDK to:\n",
11-
" * Define a RayCluster configuration\n",
12-
" * Use this configuration alongside a RayJob definition\n",
13-
" * Submit the RayJob, and allow Kuberay Operator to lifecycle the RayCluster for the RayJob"
14-
]
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "9259e514",
6+
"metadata": {},
7+
"source": [
8+
"# Submitting a RayJob CR\n",
9+
"\n",
10+
"In this notebook, we will go through the basics of using the SDK to:\n",
11+
" * Define a RayCluster configuration\n",
12+
" * Use this configuration alongside a RayJob definition\n",
13+
" * Submit the RayJob, and allow Kuberay Operator to lifecycle the RayCluster for the RayJob"
14+
]
15+
},
16+
{
17+
"cell_type": "markdown",
18+
"id": "18136ea7",
19+
"metadata": {},
20+
"source": [
21+
"## Defining and Submitting the RayJob\n",
22+
"First, we'll need to import the relevant CodeFlare SDK packages. You can do this by executing the below cell."
23+
]
24+
},
25+
{
26+
"cell_type": "code",
27+
"execution_count": null,
28+
"id": "51e18292",
29+
"metadata": {},
30+
"outputs": [],
31+
"source": [
32+
"from codeflare_sdk import RayJob, ManagedClusterConfig"
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"id": "649c5911",
38+
"metadata": {},
39+
"source": [
40+
"Run the below `oc login` command using your Token and Server URL. Ensure the command is prepended by `!` and not `%`. This will work when running both locally and within RHOAI."
41+
]
42+
},
43+
{
44+
"cell_type": "code",
45+
"execution_count": null,
46+
"id": "dc364888",
47+
"metadata": {},
48+
"outputs": [],
49+
"source": [
50+
"!oc login --token=<your-token> --server=<your-server-url>"
51+
]
52+
},
53+
{
54+
"cell_type": "markdown",
55+
"id": "5581eca9",
56+
"metadata": {},
57+
"source": [
58+
"Next we'll need to define the ManagedClusterConfig. Kuberay will use this to spin up a short-lived RayCluster that will only exist as long as the job"
59+
]
60+
},
61+
{
62+
"cell_type": "code",
63+
"execution_count": null,
64+
"id": "3094c60a",
65+
"metadata": {},
66+
"outputs": [],
67+
"source": [
68+
"cluster_config = ManagedClusterConfig(\n",
69+
" head_memory_requests=6,\n",
70+
" head_memory_limits=8,\n",
71+
" num_workers=2,\n",
72+
" worker_cpu_requests=1,\n",
73+
" worker_cpu_limits=1,\n",
74+
" worker_memory_requests=4,\n",
75+
" worker_memory_limits=6,\n",
76+
" head_accelerators={'nvidia.com/gpu': 0},\n",
77+
" worker_accelerators={'nvidia.com/gpu': 0},\n",
78+
")"
79+
]
80+
},
81+
{
82+
"cell_type": "markdown",
83+
"id": "02a2b32b",
84+
"metadata": {},
85+
"source": [
86+
"Lastly we can pass the ManagedClusterConfig into the RayJob and submit it. You do not need to worry about tearing down the cluster when the job has completed, that is handled for you!"
87+
]
88+
},
89+
{
90+
"cell_type": "code",
91+
"execution_count": null,
92+
"id": "e905ccea",
93+
"metadata": {},
94+
"outputs": [],
95+
"source": [
96+
"job = RayJob(\n",
97+
" job_name=\"demo-rayjob\",\n",
98+
" entrypoint=\"python -c 'print(\\\"Hello from RayJob!\\\")'\",\n",
99+
" cluster_config=cluster_config,\n",
100+
" namespace=\"your-namespace\",\n",
101+
" # local_queue is optional. If omitted, the SDK will auto-detect a default\n",
102+
" # Kueue LocalQueue. If Kueue is not installed, the job runs without it.\n",
103+
" # local_queue=\"my-queue\",\n",
104+
")\n",
105+
"\n",
106+
"job.submit()"
107+
]
108+
},
109+
{
110+
"cell_type": "markdown",
111+
"id": "f3612de2",
112+
"metadata": {},
113+
"source": [
114+
"We can check the status of our job by executing the below cell. The status may appear as `unknown` for a time while the RayCluster spins up."
115+
]
116+
},
117+
{
118+
"cell_type": "code",
119+
"execution_count": null,
120+
"id": "96d92f93",
121+
"metadata": {},
122+
"outputs": [],
123+
"source": [
124+
"job.status()"
125+
]
126+
}
127+
],
128+
"metadata": {
129+
"kernelspec": {
130+
"display_name": "base",
131+
"language": "python",
132+
"name": "python3"
133+
},
134+
"language_info": {
135+
"codemirror_mode": {
136+
"name": "ipython",
137+
"version": 3
138+
},
139+
"file_extension": ".py",
140+
"mimetype": "text/x-python",
141+
"name": "python",
142+
"nbconvert_exporter": "python",
143+
"pygments_lexer": "ipython3",
144+
"version": "3.12.7"
145+
}
15146
},
16-
{
17-
"cell_type": "markdown",
18-
"id": "18136ea7",
19-
"metadata": {},
20-
"source": [
21-
"## Defining and Submitting the RayJob\n",
22-
"First, we'll need to import the relevant CodeFlare SDK packages. You can do this by executing the below cell."
23-
]
24-
},
25-
{
26-
"cell_type": "code",
27-
"execution_count": null,
28-
"id": "51e18292",
29-
"metadata": {},
30-
"outputs": [],
31-
"source": [
32-
"from codeflare_sdk import RayJob, ManagedClusterConfig"
33-
]
34-
},
35-
{
36-
"cell_type": "markdown",
37-
"id": "649c5911",
38-
"metadata": {},
39-
"source": [
40-
"Run the below `oc login` command using your Token and Server URL. Ensure the command is prepended by `!` and not `%`. This will work when running both locally and within RHOAI."
41-
]
42-
},
43-
{
44-
"cell_type": "code",
45-
"execution_count": null,
46-
"id": "dc364888",
47-
"metadata": {},
48-
"outputs": [],
49-
"source": [
50-
"!oc login --token=<your-token> --server=<your-server-url>"
51-
]
52-
},
53-
{
54-
"cell_type": "markdown",
55-
"id": "5581eca9",
56-
"metadata": {},
57-
"source": [
58-
"Next we'll need to define the ManagedClusterConfig. Kuberay will use this to spin up a short-lived RayCluster that will only exist as long as the job"
59-
]
60-
},
61-
{
62-
"cell_type": "code",
63-
"execution_count": null,
64-
"id": "3094c60a",
65-
"metadata": {},
66-
"outputs": [],
67-
"source": [
68-
"cluster_config = ManagedClusterConfig(\n",
69-
" head_memory_requests=6,\n",
70-
" head_memory_limits=8,\n",
71-
" num_workers=2,\n",
72-
" worker_cpu_requests=1,\n",
73-
" worker_cpu_limits=1,\n",
74-
" worker_memory_requests=4,\n",
75-
" worker_memory_limits=6,\n",
76-
" head_accelerators={'nvidia.com/gpu': 0},\n",
77-
" worker_accelerators={'nvidia.com/gpu': 0},\n",
78-
")"
79-
]
80-
},
81-
{
82-
"cell_type": "markdown",
83-
"id": "02a2b32b",
84-
"metadata": {},
85-
"source": [
86-
"Lastly we can pass the ManagedClusterConfig into the RayJob and submit it. You do not need to worry about tearing down the cluster when the job has completed, that is handled for you!"
87-
]
88-
},
89-
{
90-
"cell_type": "code",
91-
"execution_count": null,
92-
"id": "e905ccea",
93-
"metadata": {},
94-
"outputs": [],
95-
"source": [
96-
"job = RayJob(\n",
97-
" job_name=\"demo-rayjob\",\n",
98-
" entrypoint=\"python -c 'print(\\\"Hello from RayJob!\\\")'\",\n",
99-
" cluster_config=cluster_config,\n",
100-
" namespace=\"your-namespace\"\n",
101-
")\n",
102-
"\n",
103-
"job.submit()"
104-
]
105-
},
106-
{
107-
"cell_type": "markdown",
108-
"id": "f3612de2",
109-
"metadata": {},
110-
"source": [
111-
"We can check the status of our job by executing the below cell. The status may appear as `unknown` for a time while the RayCluster spins up."
112-
]
113-
},
114-
{
115-
"cell_type": "code",
116-
"execution_count": null,
117-
"id": "96d92f93",
118-
"metadata": {},
119-
"outputs": [],
120-
"source": [
121-
"job.status()"
122-
]
123-
}
124-
],
125-
"metadata": {
126-
"kernelspec": {
127-
"display_name": "Python 3",
128-
"language": "python",
129-
"name": "python3"
130-
},
131-
"language_info": {
132-
"codemirror_mode": {
133-
"name": "ipython",
134-
"version": 3
135-
},
136-
"file_extension": ".py",
137-
"mimetype": "text/x-python",
138-
"name": "python",
139-
"nbconvert_exporter": "python",
140-
"pygments_lexer": "ipython3",
141-
"version": "3.11.11"
142-
}
143-
},
144-
"nbformat": 4,
145-
"nbformat_minor": 5
147+
"nbformat": 4,
148+
"nbformat_minor": 5
146149
}

src/codeflare_sdk/ray/rayjobs/rayjob.py

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -267,31 +267,31 @@ def _build_rayjob_cr(self) -> Dict[str, Any]:
267267
if self.local_queue:
268268
labels["kueue.x-k8s.io/queue-name"] = self.local_queue
269269
else:
270-
# Auto-detect default queue for new clusters
270+
# Auto-detect default queue for new clusters.
271+
# If no default queue is found (e.g. Kueue not installed),
272+
# skip the label entirely so the job can run without Kueue.
273+
# This matches the interactive Cluster behavior in build_ray_cluster.py.
271274
default_queue = get_default_kueue_name(self.namespace)
272275
if default_queue:
273276
labels["kueue.x-k8s.io/queue-name"] = default_queue
274277
else:
275-
# No default queue found, use "default" as fallback
276-
labels["kueue.x-k8s.io/queue-name"] = "default"
277-
logger.warning(
278+
logger.info(
278279
f"No default Kueue LocalQueue found in namespace '{self.namespace}'. "
279-
f"Using 'default' as the queue name. If a LocalQueue named 'default' "
280-
f"does not exist, the RayJob submission will fail. "
281-
f"To fix this, please explicitly specify the 'local_queue' parameter."
280+
f"Submitting RayJob without Kueue queue management. "
281+
f"To use Kueue, specify the 'local_queue' parameter or "
282+
f"annotate a LocalQueue with 'kueue.x-k8s.io/default-queue: true'."
282283
)
283284

284285
if self.priority_class:
285286
labels["kueue.x-k8s.io/priority-class"] = self.priority_class
286287

287-
# Apply labels to metadata
288+
# Apply labels to metadata.
289+
# We intentionally do NOT set suspend=true here. Kueue's mutating
290+
# webhook will set it automatically when it sees the queue label.
291+
# This way, if Kueue isn't installed, the label is harmless metadata
292+
# and the job runs immediately without hanging.
288293
if labels:
289294
rayjob_cr["metadata"]["labels"] = labels
290-
291-
# When using Kueue with lifecycled clusters, start with suspend=true
292-
# Kueue will unsuspend the job once the workload is admitted
293-
if labels.get("kueue.x-k8s.io/queue-name"):
294-
rayjob_cr["spec"]["suspend"] = True
295295
else:
296296
if self.local_queue or self.priority_class:
297297
logger.warning(

0 commit comments

Comments
 (0)