Skip to content

fix: decode pod device annotations with init containers#104

Open
FouoF wants to merge 1 commit into
Project-HAMi:mainfrom
FouoF:fix-decode-init
Open

fix: decode pod device annotations with init containers#104
FouoF wants to merge 1 commit into
Project-HAMi:mainfrom
FouoF:fix-decode-init

Conversation

@FouoF

@FouoF FouoF commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

HAMi encodes vgpu-devices-allocated entries for init containers and regular containers in order; align decoding and container mapping with that index layout so GPU usage shows correctly in the WebUI.

fix Project-HAMi/HAMi#1932

Summary by CodeRabbit

  • Bug Fixes

    • Corrected container device alignment for pods that include init containers, so device assignments now match the actual container order.
    • Improved device decoding for GPU-related resources to handle the full container sequence consistently, reducing mismatches in reported device allocation.
    • Preserved empty device entries where appropriate, ensuring container-device results stay in sync with pod structure.
  • Tests

    • Added coverage for pods with init containers to verify device decoding and priority assignment.

@hami-robot

hami-robot Bot commented Jun 26, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: FouoF

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Nimbus318

Copy link
Copy Markdown
Collaborator

Could you please rebase this PR against latest main?

Please keep the MLU decode call from #100:

DecodeMLUContainerDevices(fmt.Sprintf("%s_%s", str, instance), nodeName)

nodeName is already passed separately, so it should not be appended into the formatted string.

HAMi encodes vgpu-devices-allocated entries for init containers and
regular containers in order; align decoding and container mapping with
that index layout so GPU usage shows correctly in the WebUI.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>
@FouoF FouoF force-pushed the fix-decode-init branch from aa2cfb7 to 7c3dc77 Compare June 29, 2026 03:32
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Fixes container device index misalignment for pods with InitContainers. Introduces getContainerPriority, podContainerCount, and an initContainerOffset so that device decoding loops and fetchContainerInfo correctly map device entries using init+regular container counts rather than only regular container counts.

InitContainer device alignment

Layer / File(s) Summary
Priority and count helpers
server/internal/provider/util/util.go
Adds getContainerPriority helper, refactors GetContainerPriorities to include init containers before regular containers, and introduces podContainerCount returning len(InitContainers) + len(Containers).
DecodePodDevices loop bounds
server/internal/provider/util/util.go
Updates NvidiaGPUDevice, HygonGPUDevice, and Metax/MetaxSGPUDevice decoding loops to use podContainerCount(pod) for bounds; removes empty-decode skip guards for Hygon and Metax device types.
fetchContainerInfo offset alignment
server/internal/data/pod.go
Adds initContainerOffset and deviceIdx computation in the pod.Spec.Containers loop; gates bizContainerDevices access on bounds check; uses derived containerDevices for ContainerDevices and Priority fields.
Test: InitContainer decoding
server/internal/provider/util/util_test.go
TestDecodePodDevicesWithInitContainers constructs a pod with one init and one main container, calls DecodePodDevices, and asserts two NVIDIA slots where the init slot is empty and the main slot holds the expected device UUID and Priority.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hop hop, the index was off by one,
Init containers left in the cold, no fun.
Now offsets align where devices must go,
Priority counted from top to toe.
The rabbit checked bounds—no panic today,
All containers mapped in the right array! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: decoding pod device annotations with init containers.
Linked Issues check ✅ Passed The changes align device decoding and indexing with init-container ordering, which addresses #1932's malformed annotation layouts.
Out of Scope Changes check ✅ Passed All changes are focused on pod-device decoding and accompanying tests; no unrelated scope is evident.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/internal/provider/util/util_test.go (1)

353-403: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add coverage for fetchContainerInfo too.

This test locks down DecodePodDevices, but the user-facing fix also depends on server/internal/data/pod.go applying the same init-container offset when it maps decoded slots back onto pod.Spec.Containers. A direct test for podRepo.fetchContainerInfo (or the caller that populates biz.PodInfo.Ctrs) would catch regressions there that still pass this utility-level test.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/provider/util/util_test.go` around lines 353 - 403, Add test
coverage for the init-container offset in the higher-level pod mapping path, not
just DecodePodDevices. Introduce a focused test for fetchContainerInfo (or the
caller that builds biz.PodInfo.Ctrs) in server/internal/data/pod.go, using a pod
with init containers and one main container, and assert the decoded device slot
is mapped onto the main container after the offset is applied. Keep the existing
util test, but ensure the new test exercises the container-index translation
logic in fetchContainerInfo so regressions there are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@server/internal/provider/util/util_test.go`:
- Around line 353-403: Add test coverage for the init-container offset in the
higher-level pod mapping path, not just DecodePodDevices. Introduce a focused
test for fetchContainerInfo (or the caller that builds biz.PodInfo.Ctrs) in
server/internal/data/pod.go, using a pod with init containers and one main
container, and assert the decoded device slot is mapped onto the main container
after the offset is applied. Keep the existing util test, but ensure the new
test exercises the container-index translation logic in fetchContainerInfo so
regressions there are caught.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2a4f6f80-f0af-41d1-a07f-ae181d300c6d

📥 Commits

Reviewing files that changed from the base of the PR and between 8f42445 and 7c3dc77.

📒 Files selected for processing (3)
  • server/internal/data/pod.go
  • server/internal/provider/util/util.go
  • server/internal/provider/util/util_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hami.io/vgpu-devices-allocated 莫名其妙增加了一个; 导致hami-webui不能识别

2 participants