Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
# Mr. Fluence!
# Multi-stage build for the fluence scheduler.
#
# The scheduler binary cgo-links flux-sched (Fluxion) for resource matching.
# It does NOT depend on QRMI — quantum job submission is a separate workload
# (github.com/converged-computing/qrmi-sampler). So this image needs only
# flux-sched, no Rust/QRMI. Mirrors the .devcontainer build.

# ---------- builder ----------
FROM fluxrm/flux-core:noble AS builder

USER root
Expand Down Expand Up @@ -37,7 +33,9 @@ COPY . .
RUN CGO_ENABLED=1 \
CGO_CFLAGS="-I/opt/flux-sched" \
CGO_LDFLAGS="-L/opt/flux-sched/resource -L/opt/flux-sched/resource/libjobspec -L/opt/flux-sched/resource/reapi/bindings -lresource -ljobspec_conv -lreapi_cli -lflux-idset -lstdc++ -lczmq -ljansson -lhwloc -lboost_system -lflux-hostlist -lboost_graph -lyaml-cpp" \
go build -ldflags '-w' -o /bin/fluence ./cmd/fluence
go build -ldflags '-w' -o /bin/fluence ./cmd/fluence && \
CGO_ENABLED=0 go build -ldflags '-w' -o /bin/fluence-deviceplugin ./cmd/deviceplugin && \
CGO_ENABLED=0 go build -ldflags '-w' -o /bin/fluence-webhook ./cmd/webhook

FROM fluxrm/flux-core:noble AS runtime

Expand All @@ -55,4 +53,6 @@ COPY --from=builder /usr/lib/libjobspec_conv.so* /usr/lib/
RUN ldconfig

COPY --from=builder /bin/fluence /bin/fluence
ENTRYPOINT ["/bin/fluence"]
COPY --from=builder /bin/fluence-deviceplugin /bin/fluence-deviceplugin
COPY --from=builder /bin/fluence-webhook /bin/fluence-webhook
ENTRYPOINT ["/bin/fluence"]
7 changes: 5 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,16 @@ CGO_LDFLAGS = -L$(FLUX_SCHED_ROOT)/resource \
-lflux-hostlist -lboost_graph -lyaml-cpp

.PHONY: build
build: ## Build the fluence scheduler binary (needs flux-sched)
build: ## Build all binaries (scheduler needs flux-sched; helpers are pure Go)
CGO_ENABLED=1 CGO_CFLAGS="$(CGO_CFLAGS)" CGO_LDFLAGS="$(CGO_LDFLAGS)" \
go build -o bin/fluence ./cmd/fluence
CGO_ENABLED=0 go build -o bin/fluence-deviceplugin ./cmd/deviceplugin
CGO_ENABLED=0 go build -o bin/fluence-webhook ./cmd/webhook

.PHONY: test
test: ## Pure-Go unit tests (no flux, no k8s scheduler libs, no cluster)
go test ./pkg/jgf/... ./pkg/cluster/... ./pkg/jobspec/... ./pkg/placement/... ./pkg/quantum/...
go test ./pkg/jgf/... ./pkg/cluster/... ./pkg/jobspec/... ./pkg/placement/... \
./pkg/quantum/... ./pkg/webhook/... ./pkg/deviceplugin/...

.PHONY: test-graph
test-graph: ## Matcher tests (needs flux-sched)
Expand Down
221 changes: 162 additions & 59 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,119 +4,222 @@

A Kubernetes scheduler plugin that places **pod groups** (and individual pods)
by matching them against a [Fluxion](https://github.com/flux-framework/flux-sched)
(flux-sched) resource graph built from the live cluster.
(flux-sched) resource graph built from the live cluster.

This is an update from [flux-k8s](https://github.com/flux-framework/flux-k8s)
that uses the native PodGroup and optionally allows for scheduling
against **quantum resources** modeled in the same graph. I am also improving
the design by not requiring a sidecar for fluence - the plugin is built as one
container.
against arbitrary resources such as **quantum resources** modeled in the same graph.
I am also improving the design by not requiring a sidecar for fluence, and not
requiring the `kubernetes-sigs/scheduler-plugins` dependency. We use native Gang
scheduling provided by Kubernetes.

For quantum resource modeling, we start from the prototype proven out in
[fluxion-quantum](https://github.com/converged-computing/fluxion-quantum).
This design is an improvement upon the initial fluence because we drop
the `kubernetes-sigs/scheduler-plugins` dependency and use Kubernetes
**native gang scheduling** (the `PodGroup` API, `scheduling.k8s.io/v1alpha2`,
alpha in 1.35/1.36).
[fluxion-quantum](https://github.com/converged-computing/fluxion-quantum).

## How it works

### Gang Scheduling

Gang semantics (all-or-nothing) come from the native `PodGroup` API. Fluence is
responsible only for **placement**:

1. **Discover** — on startup fluence lists cluster nodes and turns their
cpu/memory/gpu capacity into a Fluxion JGF resource graph
(`pkg/cluster` + `pkg/jgf`). Quantum backends from a config file are injected
as `qpu` vertices under a `qgateway` (`AddQuantum`).
(`pkg/cluster` + `pkg/jgf`). If a resources config is provided (via
`FLUENCE_RESOURCES`), its entries (e.g. quantum backends) are injected as
`qpu`/`qubit` vertices. With no config the graph is classical-only.
2. **Match** — when the first pod of a group hits `PreFilter`, fluence builds a
Fluxion jobspec for the whole gang (`pkg/fluence.JobspecForGroup`), asks the
Fluxion jobspec for the whole gang (`pkg/placement.JobspecForGroup`), asks the
matcher to allocate (`pkg/graph.FluxionGraph.MatchAllocateSpec`), and parses
the allocation into node names (`PlacementFromAllocation`).
3. **Place** — `Filter` then permits each pod only on its allocated node.

For a **quantum** pod (one that requests `quantum.flux-framework.org/qpu`), the
match allocates a `qpu` vertex instead of cores; the allocated backend name
(e.g. `ibm_fez`) is what the workload submits to via
[qrmi-go](https://github.com/converged-computing/qrmi-go) (job mode on the IBM
open plan — see fluxion-quantum for that story).

```
nodes (kubectl get nodes) ─┐
├─► JGF resource graph ─► Fluxion match ─► node + backend placement
quantum-backends.yaml ─────┘
the allocation into node and backend names (`PlacementFromAllocation`).
3. **Place** — `Filter` permits each pod only on its allocated node. (A
quantum-only pod allocates a `qpu` but no node — the backend is a remote API
any node can reach — so fluence imposes no node constraint in that case.)
4. **Hand off** — for a quantum pod, `PreBind` records the allocated backend on
the pod as the `fluence.flux-framework.org/backend` annotation. The mutating
webhook (installed with the base) injects a downward-API env so the container
reads it as `QRMI_BACKEND` with no boilerplate in the manifest.

### Design Choices

While Quantum resources are this first target, notably we should be able to support
any arbitrary resource in the graph. I decided that a pod can request a graph resource generically
e.g., `fluxion.flux-framework.org/<type>` (like `.../qpu: "1"`) and that becomes a jobspec count
of `<type>`. To support this, we deploy a **device plugin** that can advertise these virtual
types on every node. We need to do this because of the in-tree `NodeResourcesFit` endpoint.
If we do not have the device plugin, this call will not be satisfied. Note that
this device plugin will return True for any resources it sees added to the Fluxion resource graph,
but is not actually involved with scheduling. Fluxion does the real matching.

```console
nodes (kubectl get nodes) ──┐
├─► JGF resource graph ─► Fluxion match ─► node + backend placement
fluence-resources ConfigMap ┘
```

I am also choosing to keep credentials and qrmi interactions on the level of the application.
I am not comfortable with the design of an operator holding any kind of credential or being
responsible for managing calls with qrmi in a multi-tenant environment. Finally, since
there are (and will continue to be) a lot of environment variables that I do not want
to place on the user to define, we have a webhook to handle this. We can combine an annotation
added with the webhook with a PreBind call to define the annotation to orchestrate that.

## Build

The scheduler binary links flux-sched (the matcher) and, for quantum, QRMI:
The scheduler binary links flux-sched (the matcher). It does **not** link QRMI —
quantum job submission lives in a separate workload container
([qrmi-sampler](https://github.com/converged-computing/qrmi-sampler)), not here.

```bash
# If you want to debug inside the .devcontainer, use this one
make build # needs flux-sched at /opt/flux-sched and QRMI at /usr/local
# Inside the .devcontainer (flux-sched at /opt/flux-sched):
# builds bin/fluence (cgo+flux) + bin/fluence-deviceplugin + bin/fluence-webhook
make build
make test

# If you want to test outside (and build the docker image, this one)
# Or build the container image (all three binaries):
make image
```

Pure-Go pieces (graph builder, discovery, jobspec, placement) need neither and
are covered by:
## Deploy

Create a development cluster on a Kubernetes release that supports native gang
scheduling, with the feature gates enabled:

```bash
make test
kind create cluster --image kindest/node:v1.36.1 --config deploy/kind-config.yaml
```

## Deploy
(See [installing kind](https://kind.sigs.k8s.io/docs/user/quick-start#installing-from-release-binaries).)
The kind config turns on the `GangScheduling` and `GenericWorkload` feature gates
and the `scheduling.k8s.io/v1alpha2` API group on the apiserver and scheduler. In
the future these will likely be enabled by default.

Here is how I am creating a development cluster with a release of Kubernetes that will support
what we need:
Load the image (built above) into the cluster:

```bash
kind create cluster --image kindest/node:v1.36.1 --config deploy/kind-config.yaml
kind load docker-image ghcr.io/converged-computing/fluence:latest
```

And if you [need to install kind](https://kind.sigs.k8s.io/docs/user/quick-start#installing-from-release-binaries).
### 1. Gang Scheduling

Install the **base** scheduler (this is all you need for classical scheduling —
no device plugin, no quantum):

```bash
# This creates the quantum backends yaml graph
kubectl create configmap fluence-quantum-backends --from-file=quantum-backends.yaml=config/quantum-backends.yaml -n kube-system
kubectl apply -f deploy/fluence.yaml
```

# load docker image
kind load docker-image ghcr.io/converged-computing/fluence
This installs the scheduler, its RBAC, and the mutating webhook. Pods opt in with
`schedulerName: fluence`; a multi-pod gang adds a `scheduling.k8s.io/pod-group`
label (a single pod is treated as a group of one and needs no label).

kubectl apply -f deploy/fluence.yaml # RBAC + scheduler in kube-system
kubectl apply -f examples/podgroup.yaml # a gang scheduled by fluence
```
## Testing

### 1. Classical (a pod group)

This works by enabling the native gang feature on the cluster (kube-scheduler / API server), meaning
the `GangScheduling` and `GenericWorkload` feature gates and the `scheduling.k8s.io/v1alpha2` API group.
In the future these will likely be enabled by default.
The base install is enough. Schedule a gang:

Pods opt in with `schedulerName: fluence` and a `scheduling.k8s.io/pod-group` label; group size can be set explicitly with
`fluence.flux-framework.org/group-size`.
```bash
kubectl apply -f examples/podgroup.yaml
kubectl get pods -o wide
kubectl get events --field-selector reason=Scheduled
kubectl get podgroups.scheduling.k8s.io
```
```console
NAME POLICY WORKLOAD STATUS AGE
training Gang <none> Scheduled 15s
```

Note that when you are developing / debugging a group deletion can hang because of finalizers. I do:
And cleanup.

```bash
kubectl patch podgroup training -n default --type=merge -p '{"metadata":{"finalizers":null}}'
kubectl delete -f examples/podgroup.yaml
```

## Quantum
### 2. Quantum

We can bing fluence up with quantum resources by pointing `FLUENCE_QUANTUM_CONFIG` at a backends file (see `config/quantum-backends.yaml`).
Those backends become schedulable `qpu` vertices; a pod requesting `quantum.flux-framework.org/qpu` will be matched to one, and the allocated backend is handed to the workload.
Quantum needs the resources add-on, which supplies the `fluence-resources`
ConfigMap (the single source of truth for which backends exist) **and** the
device plugin that advertises them:

```bash
kubectl apply -f deploy/fluence-resources.yaml
# The scheduler reads its resources config at startup, so restart it to pick up
# the quantum vertices:
kubectl rollout restart deployment/fluence -n kube-system
```

Confirm the device plugin advertised the resources on the nodes:

```bash
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable}{"\n"}{end}' \
| grep fluxion.flux-framework.org
```
```console
kind-control-plane {"cpu":"16","ephemeral-storage":"982292956Ki","fluxion.flux-framework.org/qpu":"1k","fluxion.flux-framework.org/qubit":"1k","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"61400748Ki","pods":"110"}
kind-worker {"cpu":"16","ephemeral-storage":"982292956Ki","fluxion.flux-framework.org/qpu":"1k","fluxion.flux-framework.org/qubit":"1k","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"61400748Ki","pods":"110"}
kind-worker2 {"cpu":"16","ephemeral-storage":"982292956Ki","fluxion.flux-framework.org/qpu":"1k","fluxion.flux-framework.org/qubit":"1k","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"61400748Ki","pods":"110"}
```

Create the IBM credentials the **workload** uses to submit (in the namespace
where the workload runs — the scheduler itself never needs them):

```bash
# If you don't have this yet
curl -fsSL https://clis.cloud.ibm.com/install/linux | sudo sh
ibmcloud login --apikey <key>
# 12 for us-east
```
```bash
export IBM_CLOUD_TOKEN=<key>
export IBM_CLOUD_CRN=$(ibmcloud resource service-instances --service-name quantum-computing --output json | jq -r '.[] | {name: .name, crn: .crn}' | jq -r .crn)
```

```bash
kubectl create secret generic ibm-quantum -n default --from-literal=token="$IBM_CLOUD_TOKEN" --from-literal=crn="$IBM_CLOUD_CRN"
```

Run a single quantum pod. It just requests `fluxion.flux-framework.org/qpu` — no
group, and no hard-coded backend (the webhook + PreBind supply `QRMI_BACKEND`):

```bash
kubectl apply -f examples/quantum-pod.yaml
kubectl get pod sampler -o wide

# fluence's chosen backend, injected as an environment variable:
kubectl get pod sampler -o jsonpath='{.metadata.annotations.fluence\.flux-framework\.org/backend}{"\n"}'
kubectl logs sampler
```
```console
kubectl logs sampler -f
2026/06/06 19:04:38 submitting sampler job to ibm_marrakesh
{"results": [{"data": {"c": {"samples": ["0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x0", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x0", "0x1", "0x0", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x1", "0x0", "0x1", "0x0", "0x0", "0x0", "0x1", "0x0", "0x0", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0", "0x0", "0x1", "0x1", "0x1", "0x0", "0x1", "0x1", "0x1", "0x1", "0x1", "0x1", "0x0", "0x0", "0x0", "0x0"], "num_bits": 1}}, "metadata": {"circuit_metadata": {}}}], "metadata": {"execution": {"execution_spans": [[{"date": "2026-06-06T19:04:43.221657"}, {"date": "2026-06-06T19:04:44.372421"}, {"0": [[256], [0, 1], [0, 256]]}]]}, "version": 2}}
2026/06/06 19:04:50 done: 2070 bytes from ibm_marrakesh
```
Boum!

### A note on deletion

When developing/debugging, a PodGroup (or its pods) can hang on delete because of
finalizers (the workload controller may not be running). Clear them with:

```bash
kubectl patch podgroup training -n default --type=merge -p '{"metadata":{"finalizers":null}}'
```

**under development** I am still thinking about how to make this request. -V
Importantly, submission is **not** done by the scheduler — the workload container holds the
user's credentials and submits via qrmi-go (job mode on the IBM open plan; see
fluxion-quantum for that story). Fluence only schedules and hands off the backend.
When we actually have control of local quantum devices this will be different.

## License

HPCIC DevTools is distributed under the terms of the MIT license.
All new contributions must be made under this license.

See [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE),
[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and
[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details.
See [LICENSE](LICENSE), [COPYRIGHT](COPYRIGHT), and [NOTICE](NOTICE) for details.

SPDX-License-Identifier: (MIT)
SPDX-License-Identifier: MIT

LLNL-CODE- 842614
LLNL-CODE-842614
Loading
Loading