Skip to content

Commit ce15230

Browse files
committed
feat: add deviceplugin, webhook
This adds full support for scheduling and running the quantum workload. We do this by allowing addition of custom resource types to fluxion, and in this case, the qpu. These are countable resources that can be thus returned "YES" by a node-level device plugin. I am choosing this approach over DRA for the time being because I want the user to be able to define fluxion type resources in the resource spec and not have to define DeviceClass or ResourceClaim that I still find annoying and clunky. To handle the backend, we add an envar in a webhook before pod creation to expect the envar via an annotation, and that is added in prebind. This was just tested with a quantum pod and vanilla podgroup and works great! Super cool. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
1 parent f61857a commit ce15230

20 files changed

Lines changed: 1290 additions & 206 deletions

File tree

Dockerfile

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,7 @@
1+
# Mr. Fluence!
12
# Multi-stage build for the fluence scheduler.
2-
#
33
# The scheduler binary cgo-links flux-sched (Fluxion) for resource matching.
4-
# It does NOT depend on QRMI — quantum job submission is a separate workload
5-
# (github.com/converged-computing/qrmi-sampler). So this image needs only
6-
# flux-sched, no Rust/QRMI. Mirrors the .devcontainer build.
74

8-
# ---------- builder ----------
95
FROM fluxrm/flux-core:noble AS builder
106

117
USER root
@@ -37,7 +33,9 @@ COPY . .
3733
RUN CGO_ENABLED=1 \
3834
CGO_CFLAGS="-I/opt/flux-sched" \
3935
CGO_LDFLAGS="-L/opt/flux-sched/resource -L/opt/flux-sched/resource/libjobspec -L/opt/flux-sched/resource/reapi/bindings -lresource -ljobspec_conv -lreapi_cli -lflux-idset -lstdc++ -lczmq -ljansson -lhwloc -lboost_system -lflux-hostlist -lboost_graph -lyaml-cpp" \
40-
go build -ldflags '-w' -o /bin/fluence ./cmd/fluence
36+
go build -ldflags '-w' -o /bin/fluence ./cmd/fluence && \
37+
CGO_ENABLED=0 go build -ldflags '-w' -o /bin/fluence-deviceplugin ./cmd/deviceplugin && \
38+
CGO_ENABLED=0 go build -ldflags '-w' -o /bin/fluence-webhook ./cmd/webhook
4139

4240
FROM fluxrm/flux-core:noble AS runtime
4341

@@ -55,4 +53,6 @@ COPY --from=builder /usr/lib/libjobspec_conv.so* /usr/lib/
5553
RUN ldconfig
5654

5755
COPY --from=builder /bin/fluence /bin/fluence
58-
ENTRYPOINT ["/bin/fluence"]
56+
COPY --from=builder /bin/fluence-deviceplugin /bin/fluence-deviceplugin
57+
COPY --from=builder /bin/fluence-webhook /bin/fluence-webhook
58+
ENTRYPOINT ["/bin/fluence"]

Makefile

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,16 @@ CGO_LDFLAGS = -L$(FLUX_SCHED_ROOT)/resource \
1818
-lflux-hostlist -lboost_graph -lyaml-cpp
1919

2020
.PHONY: build
21-
build: ## Build the fluence scheduler binary (needs flux-sched)
21+
build: ## Build all binaries (scheduler needs flux-sched; helpers are pure Go)
2222
CGO_ENABLED=1 CGO_CFLAGS="$(CGO_CFLAGS)" CGO_LDFLAGS="$(CGO_LDFLAGS)" \
2323
go build -o bin/fluence ./cmd/fluence
24+
CGO_ENABLED=0 go build -o bin/fluence-deviceplugin ./cmd/deviceplugin
25+
CGO_ENABLED=0 go build -o bin/fluence-webhook ./cmd/webhook
2426

2527
.PHONY: test
2628
test: ## Pure-Go unit tests (no flux, no k8s scheduler libs, no cluster)
27-
go test ./pkg/jgf/... ./pkg/cluster/... ./pkg/jobspec/... ./pkg/placement/... ./pkg/quantum/...
29+
go test ./pkg/jgf/... ./pkg/cluster/... ./pkg/jobspec/... ./pkg/placement/... \
30+
./pkg/quantum/... ./pkg/webhook/... ./pkg/deviceplugin/...
2831

2932
.PHONY: test-graph
3033
test-graph: ## Matcher tests (needs flux-sched)

README.md

Lines changed: 162 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -4,119 +4,222 @@
44

55
A Kubernetes scheduler plugin that places **pod groups** (and individual pods)
66
by matching them against a [Fluxion](https://github.com/flux-framework/flux-sched)
7-
(flux-sched) resource graph built from the live cluster.
7+
(flux-sched) resource graph built from the live cluster.
88

99
This is an update from [flux-k8s](https://github.com/flux-framework/flux-k8s)
1010
that uses the native PodGroup and optionally allows for scheduling
11-
against **quantum resources** modeled in the same graph. I am also improving
12-
the design by not requiring a sidecar for fluence - the plugin is built as one
13-
container.
11+
against arbitrary resources such as **quantum resources** modeled in the same graph.
12+
I am also improving the design by not requiring a sidecar for fluence, and not
13+
requiring the `kubernetes-sigs/scheduler-plugins` dependency. We use native Gang
14+
scheduling provided by Kubernetes.
1415

1516
For quantum resource modeling, we start from the prototype proven out in
16-
[fluxion-quantum](https://github.com/converged-computing/fluxion-quantum).
17-
This design is an improvement upon the initial fluence because we drop
18-
the `kubernetes-sigs/scheduler-plugins` dependency and use Kubernetes
19-
**native gang scheduling** (the `PodGroup` API, `scheduling.k8s.io/v1alpha2`,
20-
alpha in 1.35/1.36).
17+
[fluxion-quantum](https://github.com/converged-computing/fluxion-quantum).
2118

2219
## How it works
2320

21+
### Gang Scheduling
22+
2423
Gang semantics (all-or-nothing) come from the native `PodGroup` API. Fluence is
2524
responsible only for **placement**:
2625

2726
1. **Discover** — on startup fluence lists cluster nodes and turns their
2827
cpu/memory/gpu capacity into a Fluxion JGF resource graph
29-
(`pkg/cluster` + `pkg/jgf`). Quantum backends from a config file are injected
30-
as `qpu` vertices under a `qgateway` (`AddQuantum`).
28+
(`pkg/cluster` + `pkg/jgf`). If a resources config is provided (via
29+
`FLUENCE_RESOURCES`), its entries (e.g. quantum backends) are injected as
30+
`qpu`/`qubit` vertices. With no config the graph is classical-only.
3131
2. **Match** — when the first pod of a group hits `PreFilter`, fluence builds a
32-
Fluxion jobspec for the whole gang (`pkg/fluence.JobspecForGroup`), asks the
32+
Fluxion jobspec for the whole gang (`pkg/placement.JobspecForGroup`), asks the
3333
matcher to allocate (`pkg/graph.FluxionGraph.MatchAllocateSpec`), and parses
34-
the allocation into node names (`PlacementFromAllocation`).
35-
3. **Place**`Filter` then permits each pod only on its allocated node.
36-
37-
For a **quantum** pod (one that requests `quantum.flux-framework.org/qpu`), the
38-
match allocates a `qpu` vertex instead of cores; the allocated backend name
39-
(e.g. `ibm_fez`) is what the workload submits to via
40-
[qrmi-go](https://github.com/converged-computing/qrmi-go) (job mode on the IBM
41-
open plan — see fluxion-quantum for that story).
42-
43-
```
44-
nodes (kubectl get nodes) ─┐
45-
├─► JGF resource graph ─► Fluxion match ─► node + backend placement
46-
quantum-backends.yaml ─────┘
34+
the allocation into node and backend names (`PlacementFromAllocation`).
35+
3. **Place**`Filter` permits each pod only on its allocated node. (A
36+
quantum-only pod allocates a `qpu` but no node — the backend is a remote API
37+
any node can reach — so fluence imposes no node constraint in that case.)
38+
4. **Hand off** — for a quantum pod, `PreBind` records the allocated backend on
39+
the pod as the `fluence.flux-framework.org/backend` annotation. The mutating
40+
webhook (installed with the base) injects a downward-API env so the container
41+
reads it as `QRMI_BACKEND` with no boilerplate in the manifest.
42+
43+
### Design Choices
44+
45+
While Quantum resources are this first target, notably we should be able to support
46+
any arbitrary resource in the graph. I decided that a pod can request a graph resource generically
47+
e.g., `fluxion.flux-framework.org/<type>` (like `.../qpu: "1"`) and that becomes a jobspec count
48+
of `<type>`. To support this, we deploy a **device plugin** that can advertise these virtual
49+
types on every node. We need to do this because of the in-tree `NodeResourcesFit` endpoint.
50+
If we do not have the device plugin, this call will not be satisfied. Note that
51+
this device plugin will return True for any resources it sees added to the Fluxion resource graph,
52+
but is not actually involved with scheduling. Fluxion does the real matching.
53+
54+
```console
55+
nodes (kubectl get nodes) ──┐
56+
├─► JGF resource graph ─► Fluxion match ─► node + backend placement
57+
fluence-resources ConfigMap ┘
4758
```
4859

60+
I am also choosing to keep credentials and qrmi interactions on the level of the application.
61+
I am not comfortable with the design of an operator holding any kind of credential or being
62+
responsible for managing calls with qrmi in a multi-tenant environment. Finally, since
63+
there are (and will continue to be) a lot of environment variables that I do not want
64+
to place on the user to define, we have a webhook to handle this. We can combine an annotation
65+
added with the webhook with a PreBind call to define the annotation to orchestrate that.
66+
4967
## Build
5068

51-
The scheduler binary links flux-sched (the matcher) and, for quantum, QRMI:
69+
The scheduler binary links flux-sched (the matcher). It does **not** link QRMI —
70+
quantum job submission lives in a separate workload container
71+
([qrmi-sampler](https://github.com/converged-computing/qrmi-sampler)), not here.
5272

5373
```bash
54-
# If you want to debug inside the .devcontainer, use this one
55-
make build # needs flux-sched at /opt/flux-sched and QRMI at /usr/local
74+
# Inside the .devcontainer (flux-sched at /opt/flux-sched):
75+
# builds bin/fluence (cgo+flux) + bin/fluence-deviceplugin + bin/fluence-webhook
76+
make build
77+
make test
5678

57-
# If you want to test outside (and build the docker image, this one)
79+
# Or build the container image (all three binaries):
5880
make image
5981
```
6082

61-
Pure-Go pieces (graph builder, discovery, jobspec, placement) need neither and
62-
are covered by:
83+
## Deploy
84+
85+
Create a development cluster on a Kubernetes release that supports native gang
86+
scheduling, with the feature gates enabled:
6387

6488
```bash
65-
make test
89+
kind create cluster --image kindest/node:v1.36.1 --config deploy/kind-config.yaml
6690
```
6791

68-
## Deploy
92+
(See [installing kind](https://kind.sigs.k8s.io/docs/user/quick-start#installing-from-release-binaries).)
93+
The kind config turns on the `GangScheduling` and `GenericWorkload` feature gates
94+
and the `scheduling.k8s.io/v1alpha2` API group on the apiserver and scheduler. In
95+
the future these will likely be enabled by default.
6996

70-
Here is how I am creating a development cluster with a release of Kubernetes that will support
71-
what we need:
97+
Load the image (built above) into the cluster:
7298

7399
```bash
74-
kind create cluster --image kindest/node:v1.36.1 --config deploy/kind-config.yaml
100+
kind load docker-image ghcr.io/converged-computing/fluence:latest
75101
```
76102

77-
And if you [need to install kind](https://kind.sigs.k8s.io/docs/user/quick-start#installing-from-release-binaries).
103+
### 1. Gang Scheduling
78104

105+
Install the **base** scheduler (this is all you need for classical scheduling —
106+
no device plugin, no quantum):
79107

80108
```bash
81-
# This creates the quantum backends yaml graph
82-
kubectl create configmap fluence-quantum-backends --from-file=quantum-backends.yaml=config/quantum-backends.yaml -n kube-system
109+
kubectl apply -f deploy/fluence.yaml
110+
```
83111

84-
# load docker image
85-
kind load docker-image ghcr.io/converged-computing/fluence
112+
This installs the scheduler, its RBAC, and the mutating webhook. Pods opt in with
113+
`schedulerName: fluence`; a multi-pod gang adds a `scheduling.k8s.io/pod-group`
114+
label (a single pod is treated as a group of one and needs no label).
86115

87-
kubectl apply -f deploy/fluence.yaml # RBAC + scheduler in kube-system
88-
kubectl apply -f examples/podgroup.yaml # a gang scheduled by fluence
89-
```
116+
## Testing
117+
118+
### 1. Classical (a pod group)
90119

91-
This works by enabling the native gang feature on the cluster (kube-scheduler / API server), meaning
92-
the `GangScheduling` and `GenericWorkload` feature gates and the `scheduling.k8s.io/v1alpha2` API group.
93-
In the future these will likely be enabled by default.
120+
The base install is enough. Schedule a gang:
94121

95-
Pods opt in with `schedulerName: fluence` and a `scheduling.k8s.io/pod-group` label; group size can be set explicitly with
96-
`fluence.flux-framework.org/group-size`.
122+
```bash
123+
kubectl apply -f examples/podgroup.yaml
124+
kubectl get pods -o wide
125+
kubectl get events --field-selector reason=Scheduled
126+
kubectl get podgroups.scheduling.k8s.io
127+
```
128+
```console
129+
NAME POLICY WORKLOAD STATUS AGE
130+
training Gang <none> Scheduled 15s
131+
```
97132

98-
Note that when you are developing / debugging a group deletion can hang because of finalizers. I do:
133+
And cleanup.
99134

100135
```bash
101136
kubectl patch podgroup training -n default --type=merge -p '{"metadata":{"finalizers":null}}'
137+
kubectl delete -f examples/podgroup.yaml
102138
```
103139

104-
## Quantum
140+
### 2. Quantum
105141

106-
We can bing fluence up with quantum resources by pointing `FLUENCE_QUANTUM_CONFIG` at a backends file (see `config/quantum-backends.yaml`).
107-
Those backends become schedulable `qpu` vertices; a pod requesting `quantum.flux-framework.org/qpu` will be matched to one, and the allocated backend is handed to the workload.
142+
Quantum needs the resources add-on, which supplies the `fluence-resources`
143+
ConfigMap (the single source of truth for which backends exist) **and** the
144+
device plugin that advertises them:
145+
146+
```bash
147+
kubectl apply -f deploy/fluence-resources.yaml
148+
# The scheduler reads its resources config at startup, so restart it to pick up
149+
# the quantum vertices:
150+
kubectl rollout restart deployment/fluence -n kube-system
151+
```
152+
153+
Confirm the device plugin advertised the resources on the nodes:
154+
155+
```bash
156+
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable}{"\n"}{end}' \
157+
| grep fluxion.flux-framework.org
158+
```
159+
```console
160+
kind-control-plane {"cpu":"16","ephemeral-storage":"982292956Ki","fluxion.flux-framework.org/qpu":"1k","fluxion.flux-framework.org/qubit":"1k","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"61400748Ki","pods":"110"}
161+
kind-worker {"cpu":"16","ephemeral-storage":"982292956Ki","fluxion.flux-framework.org/qpu":"1k","fluxion.flux-framework.org/qubit":"1k","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"61400748Ki","pods":"110"}
162+
kind-worker2 {"cpu":"16","ephemeral-storage":"982292956Ki","fluxion.flux-framework.org/qpu":"1k","fluxion.flux-framework.org/qubit":"1k","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"61400748Ki","pods":"110"}
163+
```
164+
165+
Create the IBM credentials the **workload** uses to submit (in the namespace
166+
where the workload runs — the scheduler itself never needs them):
167+
168+
```bash
169+
# If you don't have this yet
170+
curl -fsSL https://clis.cloud.ibm.com/install/linux | sudo sh
171+
ibmcloud login --apikey <key>
172+
# 12 for us-east
173+
```
174+
```bash
175+
export IBM_CLOUD_TOKEN=<key>
176+
export IBM_CLOUD_CRN=$(ibmcloud resource service-instances --service-name quantum-computing --output json | jq -r '.[] | {name: .name, crn: .crn}' | jq -r .crn)
177+
```
178+
179+
```bash
180+
kubectl create secret generic ibm-quantum -n default --from-literal=token="$IBM_CLOUD_TOKEN" --from-literal=crn="$IBM_CLOUD_CRN"
181+
```
182+
183+
Run a single quantum pod. It just requests `fluxion.flux-framework.org/qpu` — no
184+
group, and no hard-coded backend (the webhook + PreBind supply `QRMI_BACKEND`):
185+
186+
```bash
187+
kubectl apply -f examples/quantum-pod.yaml
188+
kubectl get pod sampler -o wide
189+
190+
# fluence's chosen backend, injected as an environment variable:
191+
kubectl get pod sampler -o jsonpath='{.metadata.annotations.fluence\.flux-framework\.org/backend}{"\n"}'
192+
kubectl logs sampler
193+
```
194+
```console
195+
kubectl logs sampler -f
196+
2026/06/06 19:04:38 submitting sampler job to ibm_marrakesh
197+
{"results": [{"data": {"c": {"samples": ["0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x0", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0"], "num_bits": 1}}, "metadata": {"circuit_metadata": {}}}], "metadata": {"execution": {"execution_spans": [[{"date": "2026-06-06T19:04:43.221657"}, {"date": "2026-06-06T19:04:44.372421"}, {"0": [[256], [0, 1], [0, 256]]}]]}, "version": 2}}
198+
2026/06/06 19:04:50 done: 2070 bytes from ibm_marrakesh
199+
```
200+
Boum!
201+
202+
### A note on deletion
203+
204+
When developing/debugging, a PodGroup (or its pods) can hang on delete because of
205+
finalizers (the workload controller may not be running). Clear them with:
206+
207+
```bash
208+
kubectl patch podgroup training -n default --type=merge -p '{"metadata":{"finalizers":null}}'
209+
```
108210

109-
**under development** I am still thinking about how to make this request. -V
211+
Importantly, submission is **not** done by the scheduler — the workload container holds the
212+
user's credentials and submits via qrmi-go (job mode on the IBM open plan; see
213+
fluxion-quantum for that story). Fluence only schedules and hands off the backend.
214+
When we actually have control of local quantum devices this will be different.
110215

111216
## License
112217

113218
HPCIC DevTools is distributed under the terms of the MIT license.
114219
All new contributions must be made under this license.
115220

116-
See [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE),
117-
[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and
118-
[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details.
221+
See [LICENSE](LICENSE), [COPYRIGHT](COPYRIGHT), and [NOTICE](NOTICE) for details.
119222

120-
SPDX-License-Identifier: (MIT)
223+
SPDX-License-Identifier: MIT
121224

122-
LLNL-CODE- 842614
225+
LLNL-CODE-842614

0 commit comments

Comments
 (0)