Release 26q1#61
Merged
Merged
Conversation
* GPU: add e2e tests * add changes * add code review changes
Use AddDetectedDevicesToCDIRegistry from Gaudi Signed-off-by: Hyeogju Johannes Lee <hyeongju.lee@intel.com> Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Add standardized pcieRoot for Gaudi. Add pciBusID for GPU. Add deprecation comments. Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
xpu-smi gives only updates, and baseline is all "OK", whereas DeviceInfo's healthStatus is just empty. Put all "OK" when it is first created. Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
* Add resource health to pod status * Split GPU updateHealth into distict operations --------- Signed-off-by: Oksana Baranova <oksana.baranova@intel.com> Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com> Co-authored-by: Alexey Fomenko <alexey.fomenko@intel.com>
* remove sync logic and add AddDetectedDevicesToCDIRegistry Signed-off-by: Oksana Baranova <oksana.baranova@intel.com>
Signed-off-by: Oksana Baranova <oksana.baranova@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
* Fix two issues in example * Deployment should have apiVersion: apps/v1 * The CEL expression shouldn't end with || * Add suppotr for Gaudi 3 HL-338 Add 1da3:1063 device ID for HL-338; this is a PCIe form factor version of Gaudi 3. Signed-off-by: David Weinehall <david.weinehall@intel.com>
This restores HLML usage and separates the kubelet gaudi plugin into its own module with a different binary license (GPL-2.0-or-later) from the rest of the code (Apache-2.0). The vendored sources are included in the container image to comply with the GPL license obligations. HLML sources are bundled inside the container image as well, and HLML is built from sources. Co-authored-by: Alexey Fomenko <alexey.fomenko@intel.com>"
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
This commit changes the following: * GPU, QAT, and Gaudi kustomization drops support for OpenShift 4.20; the normal ways of deploying workloads to OpenShift clusters are either through helm charts or by means of an operator. OpenShift 4.21 fully supports DRA, while OpenShift 4.20 hides it behind a feature gate. * README.md has been updated to document that the kustomization requires OpenShift 4.21 or newer. * The Helm charts for GPU, QAT, and Gaudi introduces an additional value; openshift.version, with the default 421. * When installing on OpenShift 4.20 (openshift.version=420) the old resources API (v1beta1) will be used. When installing on OpenShift 4.21+ (no value passed or openshift.version=421) the new resources API (v1) will be used. * README.md has been updated to reflect that OpenShift 4.20 requires openshift.version=420 to be set. Signed-off-by: David Weinehall <david.weinehall@intel.com>
Values from helm charts are float; before comparing them they need to be converted to int. Signed-off-by: David Weinehall <david.weinehall@intel.com>
* Use semverCompare instead of integer comparison * Fix kustomization to reflect removed kustomizations Signed-off-by: David Weinehall <david.weinehall@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
* Gaudi: Use ubuntu24.04 libhlml package Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com> * Gaudi: ignore event timeout result Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com> * Gaudi: prevent deadlock in healthcare Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com> --------- Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
With using nri-plugins' udev tool, monitor udev events for the driver xe and i915 and update resourceSlice accordingly. Introduce CurrentDriver field, which can be used to taint when the device is not bound to the supported drivers. Add unit tests for the changes. Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
- add xpumanager/xpumd dependency - connect to xpumd and listen for device details and health updates - publish initial slice without memory - read memory from DRM device if in privileged mode Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Wrap PreparedClaims into checkpoint structure. Stop using string as UID, switch original types.UID. Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Remove unnecessary parts of fakesysfs for MEI Keep the previous commit just in case Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
* Update GPU helm chart to v0.10.0 and fix nodeFeatureRules handling. - Enable health monitoring by default, add ignore-health-warning flag for custom xpumd rules. - Make privileged mode configurable via values. - Mount xpumd socket into plugin container. - Fix nodeSelector and NodeFeatureRule condition. Signed-off-by: Oksana Baranova <oksana.baranova@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
There were several issues that are fixed in this commit: - properly replacing main package in cmd/kubelet-gaudi-plugin/go.mod - adding CGO_ENABLED=1 for Gaudi tests - adding habanalabs-firmware-tools into Dockerfile.gaudi-test to replace the outdated pkg/fakehlml/hlml.h with new GPL-licensed one - making hack/fake_hlml and pkg/fakehlml use /usr/include/habanalabs - adding standalone gaudi-test make target and aliasing test target to run it as well as gpu-and-qat-test Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
uniemimu
approved these changes
Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.