|
4 | 4 |
|
5 | 5 | A Kubernetes scheduler plugin that places **pod groups** (and individual pods) |
6 | 6 | by matching them against a [Fluxion](https://github.com/flux-framework/flux-sched) |
7 | | -(flux-sched) resource graph built from the live cluster. |
| 7 | +(flux-sched) resource graph built from the live cluster. |
8 | 8 |
|
9 | 9 | This is an update from [flux-k8s](https://github.com/flux-framework/flux-k8s) |
10 | 10 | that uses the native PodGroup and optionally allows for scheduling |
11 | | -against **quantum resources** modeled in the same graph. I am also improving |
12 | | -the design by not requiring a sidecar for fluence - the plugin is built as one |
13 | | -container. |
| 11 | +against arbitrary resources such as **quantum resources** modeled in the same graph. |
| 12 | +I am also improving the design by not requiring a sidecar for fluence, and not |
| 13 | +requiring the `kubernetes-sigs/scheduler-plugins` dependency. We use native Gang |
| 14 | +scheduling provided by Kubernetes. |
14 | 15 |
|
15 | 16 | For quantum resource modeling, we start from the prototype proven out in |
16 | | -[fluxion-quantum](https://github.com/converged-computing/fluxion-quantum). |
17 | | -This design is an improvement upon the initial fluence because we drop |
18 | | -the `kubernetes-sigs/scheduler-plugins` dependency and use Kubernetes |
19 | | -**native gang scheduling** (the `PodGroup` API, `scheduling.k8s.io/v1alpha2`, |
20 | | -alpha in 1.35/1.36). |
| 17 | +[fluxion-quantum](https://github.com/converged-computing/fluxion-quantum). |
21 | 18 |
|
22 | 19 | ## How it works |
23 | 20 |
|
| 21 | +### Gang Scheduling |
| 22 | + |
24 | 23 | Gang semantics (all-or-nothing) come from the native `PodGroup` API. Fluence is |
25 | 24 | responsible only for **placement**: |
26 | 25 |
|
27 | 26 | 1. **Discover** — on startup fluence lists cluster nodes and turns their |
28 | 27 | cpu/memory/gpu capacity into a Fluxion JGF resource graph |
29 | | - (`pkg/cluster` + `pkg/jgf`). Quantum backends from a config file are injected |
30 | | - as `qpu` vertices under a `qgateway` (`AddQuantum`). |
| 28 | + (`pkg/cluster` + `pkg/jgf`). If a resources config is provided (via |
| 29 | + `FLUENCE_RESOURCES`), its entries (e.g. quantum backends) are injected as |
| 30 | + `qpu`/`qubit` vertices. With no config the graph is classical-only. |
31 | 31 | 2. **Match** — when the first pod of a group hits `PreFilter`, fluence builds a |
32 | | - Fluxion jobspec for the whole gang (`pkg/fluence.JobspecForGroup`), asks the |
| 32 | + Fluxion jobspec for the whole gang (`pkg/placement.JobspecForGroup`), asks the |
33 | 33 | matcher to allocate (`pkg/graph.FluxionGraph.MatchAllocateSpec`), and parses |
34 | | - the allocation into node names (`PlacementFromAllocation`). |
35 | | -3. **Place** — `Filter` then permits each pod only on its allocated node. |
36 | | - |
37 | | -For a **quantum** pod (one that requests `quantum.flux-framework.org/qpu`), the |
38 | | -match allocates a `qpu` vertex instead of cores; the allocated backend name |
39 | | -(e.g. `ibm_fez`) is what the workload submits to via |
40 | | -[qrmi-go](https://github.com/converged-computing/qrmi-go) (job mode on the IBM |
41 | | -open plan — see fluxion-quantum for that story). |
42 | | - |
43 | | -``` |
44 | | -nodes (kubectl get nodes) ─┐ |
45 | | - ├─► JGF resource graph ─► Fluxion match ─► node + backend placement |
46 | | -quantum-backends.yaml ─────┘ |
| 34 | + the allocation into node and backend names (`PlacementFromAllocation`). |
| 35 | +3. **Place** — `Filter` permits each pod only on its allocated node. (A |
| 36 | + quantum-only pod allocates a `qpu` but no node — the backend is a remote API |
| 37 | + any node can reach — so fluence imposes no node constraint in that case.) |
| 38 | +4. **Hand off** — for a quantum pod, `PreBind` records the allocated backend on |
| 39 | + the pod as the `fluence.flux-framework.org/backend` annotation. The mutating |
| 40 | + webhook (installed with the base) injects a downward-API env so the container |
| 41 | + reads it as `QRMI_BACKEND` with no boilerplate in the manifest. |
| 42 | + |
| 43 | +### Design Choices |
| 44 | + |
| 45 | +While Quantum resources are this first target, notably we should be able to support |
| 46 | +any arbitrary resource in the graph. I decided that a pod can request a graph resource generically |
| 47 | +e.g., `fluxion.flux-framework.org/<type>` (like `.../qpu: "1"`) and that becomes a jobspec count |
| 48 | +of `<type>`. To support this, we deploy a **device plugin** that can advertise these virtual |
| 49 | +types on every node. We need to do this because of the in-tree `NodeResourcesFit` endpoint. |
| 50 | +If we do not have the device plugin, this call will not be satisfied. Note that |
| 51 | +this device plugin will return True for any resources it sees added to the Fluxion resource graph, |
| 52 | +but is not actually involved with scheduling. Fluxion does the real matching. |
| 53 | + |
| 54 | +```console |
| 55 | +nodes (kubectl get nodes) ──┐ |
| 56 | + ├─► JGF resource graph ─► Fluxion match ─► node + backend placement |
| 57 | +fluence-resources ConfigMap ┘ |
47 | 58 | ``` |
48 | 59 |
|
| 60 | +I am also choosing to keep credentials and qrmi interactions on the level of the application. |
| 61 | +I am not comfortable with the design of an operator holding any kind of credential or being |
| 62 | +responsible for managing calls with qrmi in a multi-tenant environment. Finally, since |
| 63 | +there are (and will continue to be) a lot of environment variables that I do not want |
| 64 | +to place on the user to define, we have a webhook to handle this. We can combine an annotation |
| 65 | +added with the webhook with a PreBind call to define the annotation to orchestrate that. |
| 66 | + |
49 | 67 | ## Build |
50 | 68 |
|
51 | | -The scheduler binary links flux-sched (the matcher) and, for quantum, QRMI: |
| 69 | +The scheduler binary links flux-sched (the matcher). It does **not** link QRMI — |
| 70 | +quantum job submission lives in a separate workload container |
| 71 | +([qrmi-sampler](https://github.com/converged-computing/qrmi-sampler)), not here. |
52 | 72 |
|
53 | 73 | ```bash |
54 | | -# If you want to debug inside the .devcontainer, use this one |
55 | | -make build # needs flux-sched at /opt/flux-sched and QRMI at /usr/local |
| 74 | +# Inside the .devcontainer (flux-sched at /opt/flux-sched): |
| 75 | +# builds bin/fluence (cgo+flux) + bin/fluence-deviceplugin + bin/fluence-webhook |
| 76 | +make build |
| 77 | +make test |
56 | 78 |
|
57 | | -# If you want to test outside (and build the docker image, this one) |
| 79 | +# Or build the container image (all three binaries): |
58 | 80 | make image |
59 | 81 | ``` |
60 | 82 |
|
61 | | -Pure-Go pieces (graph builder, discovery, jobspec, placement) need neither and |
62 | | -are covered by: |
| 83 | +## Deploy |
| 84 | + |
| 85 | +Create a development cluster on a Kubernetes release that supports native gang |
| 86 | +scheduling, with the feature gates enabled: |
63 | 87 |
|
64 | 88 | ```bash |
65 | | -make test |
| 89 | +kind create cluster --image kindest/node:v1.36.1 --config deploy/kind-config.yaml |
66 | 90 | ``` |
67 | 91 |
|
68 | | -## Deploy |
| 92 | +(See [installing kind](https://kind.sigs.k8s.io/docs/user/quick-start#installing-from-release-binaries).) |
| 93 | +The kind config turns on the `GangScheduling` and `GenericWorkload` feature gates |
| 94 | +and the `scheduling.k8s.io/v1alpha2` API group on the apiserver and scheduler. In |
| 95 | +the future these will likely be enabled by default. |
69 | 96 |
|
70 | | -Here is how I am creating a development cluster with a release of Kubernetes that will support |
71 | | -what we need: |
| 97 | +Load the image (built above) into the cluster: |
72 | 98 |
|
73 | 99 | ```bash |
74 | | -kind create cluster --image kindest/node:v1.36.1 --config deploy/kind-config.yaml |
| 100 | +kind load docker-image ghcr.io/converged-computing/fluence:latest |
75 | 101 | ``` |
76 | 102 |
|
77 | | -And if you [need to install kind](https://kind.sigs.k8s.io/docs/user/quick-start#installing-from-release-binaries). |
| 103 | +### 1. Gang Scheduling |
78 | 104 |
|
| 105 | +Install the **base** scheduler (this is all you need for classical scheduling — |
| 106 | +no device plugin, no quantum): |
79 | 107 |
|
80 | 108 | ```bash |
81 | | -# This creates the quantum backends yaml graph |
82 | | -kubectl create configmap fluence-quantum-backends --from-file=quantum-backends.yaml=config/quantum-backends.yaml -n kube-system |
| 109 | +kubectl apply -f deploy/fluence.yaml |
| 110 | +``` |
83 | 111 |
|
84 | | -# load docker image |
85 | | -kind load docker-image ghcr.io/converged-computing/fluence |
| 112 | +This installs the scheduler, its RBAC, and the mutating webhook. Pods opt in with |
| 113 | +`schedulerName: fluence`; a multi-pod gang adds a `scheduling.k8s.io/pod-group` |
| 114 | +label (a single pod is treated as a group of one and needs no label). |
86 | 115 |
|
87 | | -kubectl apply -f deploy/fluence.yaml # RBAC + scheduler in kube-system |
88 | | -kubectl apply -f examples/podgroup.yaml # a gang scheduled by fluence |
89 | | -``` |
| 116 | +## Testing |
| 117 | + |
| 118 | +### 1. Classical (a pod group) |
90 | 119 |
|
91 | | -This works by enabling the native gang feature on the cluster (kube-scheduler / API server), meaning |
92 | | -the `GangScheduling` and `GenericWorkload` feature gates and the `scheduling.k8s.io/v1alpha2` API group. |
93 | | -In the future these will likely be enabled by default. |
| 120 | +The base install is enough. Schedule a gang: |
94 | 121 |
|
95 | | -Pods opt in with `schedulerName: fluence` and a `scheduling.k8s.io/pod-group` label; group size can be set explicitly with |
96 | | -`fluence.flux-framework.org/group-size`. |
| 122 | +```bash |
| 123 | +kubectl apply -f examples/podgroup.yaml |
| 124 | +kubectl get pods -o wide |
| 125 | +kubectl get events --field-selector reason=Scheduled |
| 126 | +kubectl get podgroups.scheduling.k8s.io |
| 127 | +``` |
| 128 | +```console |
| 129 | +NAME POLICY WORKLOAD STATUS AGE |
| 130 | +training Gang <none> Scheduled 15s |
| 131 | +``` |
97 | 132 |
|
98 | | -Note that when you are developing / debugging a group deletion can hang because of finalizers. I do: |
| 133 | +And cleanup. |
99 | 134 |
|
100 | 135 | ```bash |
101 | 136 | kubectl patch podgroup training -n default --type=merge -p '{"metadata":{"finalizers":null}}' |
| 137 | +kubectl delete -f examples/podgroup.yaml |
102 | 138 | ``` |
103 | 139 |
|
104 | | -## Quantum |
| 140 | +### 2. Quantum |
105 | 141 |
|
106 | | -We can bing fluence up with quantum resources by pointing `FLUENCE_QUANTUM_CONFIG` at a backends file (see `config/quantum-backends.yaml`). |
107 | | -Those backends become schedulable `qpu` vertices; a pod requesting `quantum.flux-framework.org/qpu` will be matched to one, and the allocated backend is handed to the workload. |
| 142 | +Quantum needs the resources add-on, which supplies the `fluence-resources` |
| 143 | +ConfigMap (the single source of truth for which backends exist) **and** the |
| 144 | +device plugin that advertises them: |
| 145 | + |
| 146 | +```bash |
| 147 | +kubectl apply -f deploy/fluence-resources.yaml |
| 148 | +# The scheduler reads its resources config at startup, so restart it to pick up |
| 149 | +# the quantum vertices: |
| 150 | +kubectl rollout restart deployment/fluence -n kube-system |
| 151 | +``` |
| 152 | + |
| 153 | +Confirm the device plugin advertised the resources on the nodes: |
| 154 | + |
| 155 | +```bash |
| 156 | +kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable}{"\n"}{end}' \ |
| 157 | + | grep fluxion.flux-framework.org |
| 158 | +``` |
| 159 | +```console |
| 160 | +kind-control-plane {"cpu":"16","ephemeral-storage":"982292956Ki","fluxion.flux-framework.org/qpu":"1k","fluxion.flux-framework.org/qubit":"1k","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"61400748Ki","pods":"110"} |
| 161 | +kind-worker {"cpu":"16","ephemeral-storage":"982292956Ki","fluxion.flux-framework.org/qpu":"1k","fluxion.flux-framework.org/qubit":"1k","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"61400748Ki","pods":"110"} |
| 162 | +kind-worker2 {"cpu":"16","ephemeral-storage":"982292956Ki","fluxion.flux-framework.org/qpu":"1k","fluxion.flux-framework.org/qubit":"1k","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"61400748Ki","pods":"110"} |
| 163 | +``` |
| 164 | + |
| 165 | +Create the IBM credentials the **workload** uses to submit (in the namespace |
| 166 | +where the workload runs — the scheduler itself never needs them): |
| 167 | + |
| 168 | +```bash |
| 169 | +# If you don't have this yet |
| 170 | +curl -fsSL https://clis.cloud.ibm.com/install/linux | sudo sh |
| 171 | +ibmcloud login --apikey <key> |
| 172 | +# 12 for us-east |
| 173 | +``` |
| 174 | +```bash |
| 175 | +export IBM_CLOUD_TOKEN=<key> |
| 176 | +export IBM_CLOUD_CRN=$(ibmcloud resource service-instances --service-name quantum-computing --output json | jq -r '.[] | {name: .name, crn: .crn}' | jq -r .crn) |
| 177 | +``` |
| 178 | + |
| 179 | +```bash |
| 180 | +kubectl create secret generic ibm-quantum -n default --from-literal=token="$IBM_CLOUD_TOKEN" --from-literal=crn="$IBM_CLOUD_CRN" |
| 181 | +``` |
| 182 | + |
| 183 | +Run a single quantum pod. It just requests `fluxion.flux-framework.org/qpu` — no |
| 184 | +group, and no hard-coded backend (the webhook + PreBind supply `QRMI_BACKEND`): |
| 185 | + |
| 186 | +```bash |
| 187 | +kubectl apply -f examples/quantum-pod.yaml |
| 188 | +kubectl get pod sampler -o wide |
| 189 | + |
| 190 | +# fluence's chosen backend, injected as an environment variable: |
| 191 | +kubectl get pod sampler -o jsonpath='{.metadata.annotations.fluence\.flux-framework\.org/backend}{"\n"}' |
| 192 | +kubectl logs sampler |
| 193 | +``` |
| 194 | +```console |
| 195 | +kubectl logs sampler -f |
| 196 | +2026/06/06 19:04:38 submitting sampler job to ibm_marrakesh |
| 197 | +{"results": [{"data": {"c": {"samples": ["0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x0", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0"], "num_bits": 1}}, "metadata": {"circuit_metadata": {}}}], "metadata": {"execution": {"execution_spans": [[{"date": "2026-06-06T19:04:43.221657"}, {"date": "2026-06-06T19:04:44.372421"}, {"0": [[256], [0, 1], [0, 256]]}]]}, "version": 2}} |
| 198 | +2026/06/06 19:04:50 done: 2070 bytes from ibm_marrakesh |
| 199 | +``` |
| 200 | +Boum! |
| 201 | + |
| 202 | +### A note on deletion |
| 203 | + |
| 204 | +When developing/debugging, a PodGroup (or its pods) can hang on delete because of |
| 205 | +finalizers (the workload controller may not be running). Clear them with: |
| 206 | + |
| 207 | +```bash |
| 208 | +kubectl patch podgroup training -n default --type=merge -p '{"metadata":{"finalizers":null}}' |
| 209 | +``` |
108 | 210 |
|
109 | | -**under development** I am still thinking about how to make this request. -V |
| 211 | +Importantly, submission is **not** done by the scheduler — the workload container holds the |
| 212 | +user's credentials and submits via qrmi-go (job mode on the IBM open plan; see |
| 213 | +fluxion-quantum for that story). Fluence only schedules and hands off the backend. |
| 214 | +When we actually have control of local quantum devices this will be different. |
110 | 215 |
|
111 | 216 | ## License |
112 | 217 |
|
113 | 218 | HPCIC DevTools is distributed under the terms of the MIT license. |
114 | 219 | All new contributions must be made under this license. |
115 | 220 |
|
116 | | -See [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE), |
117 | | -[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and |
118 | | -[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details. |
| 221 | +See [LICENSE](LICENSE), [COPYRIGHT](COPYRIGHT), and [NOTICE](NOTICE) for details. |
119 | 222 |
|
120 | | -SPDX-License-Identifier: (MIT) |
| 223 | +SPDX-License-Identifier: MIT |
121 | 224 |
|
122 | | -LLNL-CODE- 842614 |
| 225 | +LLNL-CODE-842614 |
0 commit comments