Skip to content

fix: resolve index out of range panic in fetchContainerInfo for Ascend devices#95

Open
cygnushan wants to merge 1 commit into
Project-HAMi:mainfrom
cygnushan:fix-ascend-fetchcontainerinfo-panic
Open

fix: resolve index out of range panic in fetchContainerInfo for Ascend devices#95
cygnushan wants to merge 1 commit into
Project-HAMi:mainfrom
cygnushan:fix-ascend-fetchcontainerinfo-panic

Conversation

@cygnushan

@cygnushan cygnushan commented May 19, 2026

Copy link
Copy Markdown

Description

This PR fixes the runtime error: index out of range [1] with length 1 panic that occurs in fetchContainerInfo when scanning Pods using Ascend310P / AscendGPU devices.

Previous PR #35 fixed the same symptom for NVIDIA/Hygon/Metax paths, but the Ascend branch in DecodePodDevices was missed and still lacks the container-index boundary check.

Root Cause

  1. DecodePodDevices (Ascend path): strings.Split(str, OnePodMultiContainerSplitSymbol) may yield more segments than actual containers in the pod spec, causing PodSingleDevice (i.e. []ContainerDevices) to be longer than len(pod.Spec.Containers).
  2. fetchContainerInfo: The old code blindly copied decoded devices into a slice and then indexed it with the container loop variable, triggering the panic when lengths mismatch.

Changes

  • server/internal/data/pod.go: Pre-allocate bizContainerDevices based on len(pod.Spec.Containers) and merge per-container devices with index guard (i < numContainers). Also fixes device data loss for multi-device-type pods by appending instead of overwriting.
  • server/internal/provider/util/util.go: Add i >= len(pod.Spec.Containers) { break } guard and empty-segment handling (s == "") to the AscendGPUDevice / Ascend310PGPUDevice branch, aligning it with NVIDIA/Hygon/Metax behavior.

Related

Summary by CodeRabbit

  • Bug Fixes
    • Improved pod device decoding to keep per-container device assignments correctly aligned, especially when allocation data includes multiple containers.
    • Preserves empty allocation segments as placeholders to prevent container order mismatches during parsing.
    • More robust handling for truncated, malformed, or empty device allocation annotations, avoiding incorrect device placement.
  • Tests
    • Added coverage for Ascend device decoding, including multi-container, truncation, placeholder behavior, and malformed/empty annotation scenarios.

@hami-robot

hami-robot Bot commented May 19, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cygnushan

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot

hami-robot Bot commented May 19, 2026

Copy link
Copy Markdown

Welcome @cygnushan! It looks like this is your first PR to Project-HAMi/HAMi-WebUI 🎉

@hami-robot hami-robot Bot added the size/S label May 19, 2026
@cygnushan cygnushan force-pushed the fix-ascend-fetchcontainerinfo-panic branch from 7a7afe4 to 5e78963 Compare May 19, 2026 08:51
@hami-robot hami-robot Bot added size/L and removed size/S labels May 19, 2026
@cygnushan cygnushan force-pushed the fix-ascend-fetchcontainerinfo-panic branch from 5e78963 to d1cd2c1 Compare May 19, 2026 08:54
@cygnushan

Copy link
Copy Markdown
Author

Hi @archlitchi ,
This PR has been open for about two weeks now. When you have a chance, could someone please take a look and provide feedback?
Thanks in advance!

@cygnushan

Copy link
Copy Markdown
Author

@coderabbitai Please review thie PR!

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

@cygnushan I'll review this PR now.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 04f58b99-95ce-44ea-8004-077d718121f5

📥 Commits

Reviewing files that changed from the base of the PR and between d1cd2c1 and 9593759.

📒 Files selected for processing (3)
  • server/internal/data/pod.go
  • server/internal/provider/util/util.go
  • server/internal/provider/util/util_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • server/internal/data/pod.go
  • server/internal/provider/util/util_test.go
  • server/internal/provider/util/util.go

📝 Walkthrough

Walkthrough

Fixes an Ascend pod device decoding and merge path that could misalign container devices and trigger index-out-of-range panics. The Ascend decoder now preserves container positions, fetchContainerInfo merges by container index, and new table-driven tests cover the Ascend cases.

Changes

Ascend device decode and merge

Layer / File(s) Summary
Ascend decode bounds and placeholders
server/internal/provider/util/util.go, server/internal/provider/util/util_test.go
Ascend/NPU decoding now stops at the pod container count and preserves empty segment positions with ContainerDevices{}; new table-driven tests cover truncation, placeholders, malformed annotations, and empty annotations.
Container device merge update
server/internal/data/pod.go
fetchContainerInfo preallocates container device storage, merges decoded devices by container index, skips failed copy attempts with a warning, and returns early when no decoded device types are present.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through pods both big and small,
And kept each device in its stall.
No more bounds that bite or break—
Just tidy slots for bunny’s sake,
With Ascend paths steady through it all.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main fix: preventing an index-out-of-range panic in fetchContainerInfo for Ascend devices.
Linked Issues check ✅ Passed The changes add Ascend container-count bounds checks and related handling, which addresses issue #94's panic root cause.
Out of Scope Changes check ✅ Passed The added preallocation, empty-segment handling, and tests all support the Ascend panic fix and appear in scope.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
server/internal/provider/util/util_test.go (2)

356-357: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Test mutates the package-global SupportDevices map.

Registering Ascend keys directly into the shared SupportDevices (same as Test_DecodePodDevices does for NVIDIA) leaks state across tests in this package and can cause order-dependent failures. Prefer a t.Cleanup to delete the injected keys, or operate on a local copy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/provider/util/util_test.go` around lines 356 - 357, The test
is mutating the shared package-global SupportDevices map by inserting Ascend
entries, which can leak state into other tests. Update the relevant test in
util_test.go to avoid permanent changes to SupportDevices by either using a
local copy inside the test or registering a t.Cleanup that deletes the injected
Ascend keys after the test finishes. Keep the fix near the test logic that adds
the Ascend and Ascend310P entries so the state is restored consistently.

366-366: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Unused struct field wantErrMsg. It is declared but never asserted in the test loop. Drop it or wire it into an error-message assertion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/provider/util/util_test.go` at line 366, The test case struct
in util_test.go has an unused wantErrMsg field that is never checked in the test
loop. Either remove wantErrMsg from the table-driven test cases or update the
test logic to assert the returned error message against it, using the existing
test loop and error handling in the util_test.go test function.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/data/pod.go`:
- Around line 119-127: The loop in Pod device aggregation ignores the result of
copier.Copy, so a failed copy leaves bizPds empty and silently drops that device
set. Update the logic in the pod device conversion path (the loop over pdevices
in the Pod data handling code) to check the copier.Copy return value, and either
log the failure with enough context or return it so the caller can handle it;
keep the rest of the biz.PodSingleDevice iteration unchanged.

In `@server/internal/provider/util/util_test.go`:
- Around line 467-481: Update the DecodePodDevices test case for the Ascend310P
malformed-device annotation so it matches the current parser behavior: the
bad-format string should not be treated as an error. In the existing
table-driven test near the Ascend310P case, change the expectation to wantErr
false and assert that PodDevices contains an empty "Ascend310P" entry rather
than an empty result.

---

Nitpick comments:
In `@server/internal/provider/util/util_test.go`:
- Around line 356-357: The test is mutating the shared package-global
SupportDevices map by inserting Ascend entries, which can leak state into other
tests. Update the relevant test in util_test.go to avoid permanent changes to
SupportDevices by either using a local copy inside the test or registering a
t.Cleanup that deletes the injected Ascend keys after the test finishes. Keep
the fix near the test logic that adds the Ascend and Ascend310P entries so the
state is restored consistently.
- Line 366: The test case struct in util_test.go has an unused wantErrMsg field
that is never checked in the test loop. Either remove wantErrMsg from the
table-driven test cases or update the test logic to assert the returned error
message against it, using the existing test loop and error handling in the
util_test.go test function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 17674e22-330e-44c3-867a-0c3dd36263e9

📥 Commits

Reviewing files that changed from the base of the PR and between f60dbed and d1cd2c1.

📒 Files selected for processing (3)
  • server/internal/data/pod.go
  • server/internal/provider/util/util.go
  • server/internal/provider/util/util_test.go

Comment thread server/internal/data/pod.go
Comment thread server/internal/provider/util/util_test.go
…d devices

- Pre-allocate bizContainerDevices based on actual container count
  to prevent out-of-bounds access when annotation device count
  differs from pod container count
- Merge devices from all device types per container instead of
  overwriting, which previously caused device data loss for
  multi-device-type pods
- Add container index boundary check in DecodePodDevices Ascend
  branch to align with NVIDIA/Hygon/Metax device handling

Fixes Project-HAMi#94

Signed-off-by: handong <cygnushan@yunify.com>
@cygnushan cygnushan force-pushed the fix-ascend-fetchcontainerinfo-panic branch from d1cd2c1 to 9593759 Compare June 30, 2026 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Ascend310P Pod 触发 fetchContainerInfo 数组越界 panic,已有 PR #35 未覆盖昇腾路径

2 participants