Skip to content

Release 26q1#61

Merged
byako merged 64 commits into
mainfrom
release-26q1
Apr 17, 2026
Merged

Release 26q1#61
byako merged 64 commits into
mainfrom
release-26q1

Conversation

@byako
Copy link
Copy Markdown
Contributor

@byako byako commented Apr 17, 2026

No description provided.

oxxenix and others added 30 commits April 17, 2026 16:35
* GPU: add e2e tests

* add changes

* add code review changes
Use AddDetectedDevicesToCDIRegistry from Gaudi

Signed-off-by: Hyeogju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Add standardized pcieRoot for Gaudi.
Add pciBusID for GPU.
Add deprecation comments.

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
xpu-smi gives only updates, and baseline is all "OK", whereas
DeviceInfo's healthStatus is just empty. Put all "OK" when it is
first created.

Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
* Add resource health to pod status

* Split GPU updateHealth into distict operations

---------
Signed-off-by: Oksana Baranova <oksana.baranova@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Co-authored-by: Alexey Fomenko <alexey.fomenko@intel.com>
* remove sync logic and add AddDetectedDevicesToCDIRegistry

Signed-off-by: Oksana Baranova <oksana.baranova@intel.com>
Signed-off-by: Oksana Baranova <oksana.baranova@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
* Fix two issues in example

* Deployment should have apiVersion: apps/v1
* The CEL expression shouldn't end with ||

* Add suppotr for Gaudi 3 HL-338

Add 1da3:1063 device ID for HL-338; this is a PCIe form factor
version of Gaudi 3.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
This restores HLML usage and separates the kubelet gaudi plugin into its
own module with a different binary license (GPL-2.0-or-later) from the
rest of the code (Apache-2.0). The vendored sources are included in the
container image to comply with the GPL license obligations. HLML sources
are bundled inside the container image as well, and HLML is built from
sources.

Co-authored-by: Alexey Fomenko <alexey.fomenko@intel.com>"
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
This commit changes the following:

* GPU, QAT, and Gaudi kustomization drops support for OpenShift 4.20;
  the normal ways of deploying workloads to OpenShift clusters are
  either through helm charts or by means of an operator.
  OpenShift 4.21 fully supports DRA, while OpenShift 4.20 hides it
  behind a feature gate.
* README.md has been updated to document that the kustomization
  requires OpenShift 4.21 or newer.
* The Helm charts for GPU, QAT, and Gaudi introduces an additional
  value; openshift.version, with the default 421.
* When installing on OpenShift 4.20 (openshift.version=420)
  the old resources API (v1beta1) will be used. When installing
  on OpenShift 4.21+ (no value passed or openshift.version=421)
  the new resources API (v1) will be used.
* README.md has been updated to reflect that OpenShift 4.20
  requires openshift.version=420 to be set.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
Values from helm charts are float; before comparing them
they need to be converted to int.

Signed-off-by: David Weinehall <david.weinehall@intel.com>
* Use semverCompare instead of integer comparison
* Fix kustomization to reflect removed kustomizations

Signed-off-by: David Weinehall <david.weinehall@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
byako and others added 28 commits April 17, 2026 17:00
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
* Gaudi: Use ubuntu24.04 libhlml package

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>

* Gaudi: ignore event timeout result

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>

* Gaudi: prevent deadlock in healthcare

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>

---------

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
With using nri-plugins' udev tool, monitor udev events for the
driver xe and i915 and update resourceSlice accordingly.

Introduce CurrentDriver field, which can be used to taint when
the device is not bound to the supported drivers.

Add unit tests for the changes.

Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
- add xpumanager/xpumd dependency
- connect to xpumd and listen for device details and health updates
- publish initial slice without memory
- read memory from DRM device if in privileged mode

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Wrap PreparedClaims into checkpoint structure.
Stop using string as UID, switch original types.UID.

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Remove unnecessary parts of fakesysfs for MEI
Keep the previous commit just in case

Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
* Update GPU helm chart to v0.10.0 and fix nodeFeatureRules handling.

- Enable health monitoring by default, add ignore-health-warning flag for custom xpumd rules.
- Make privileged mode configurable via values.
- Mount xpumd socket into plugin container.
- Fix nodeSelector and NodeFeatureRule condition.

Signed-off-by: Oksana Baranova <oksana.baranova@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
There were several issues that are fixed in this commit:
- properly replacing main package in cmd/kubelet-gaudi-plugin/go.mod
- adding CGO_ENABLED=1 for Gaudi tests
- adding habanalabs-firmware-tools into Dockerfile.gaudi-test
  to replace the outdated pkg/fakehlml/hlml.h with new GPL-licensed one
- making hack/fake_hlml and pkg/fakehlml use /usr/include/habanalabs
- adding standalone gaudi-test make target and aliasing test target
  to run it as well as gpu-and-qat-test

Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
@byako byako merged commit 320ecda into main Apr 17, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants