fix: correct Ascend GPU vcore allocation based on memory ratio by peachest · Pull Request #107 · Project-HAMi/HAMi-WebUI

peachest · 2026-06-26T11:40:32Z

Problem

For Ascend GPU, Usedcores is a static value (25, 50, etc.) that reflects a fixed mapping to memory size rather than the actual core allocation ratio on the device. Setting HamiContainerVcoreAllocated to this raw value produces misleading metrics.

Fix

Before setting HamiContainerVcoreAllocated for Ascend devices, recalculate core as the percentage of device memory allocated to this container relative to total device memory:

perc = memory / deviceMemSize
core = int32(100 * perc)

This matches the intended semantics of VcoreAllocated — the share of the device allocated to the container.

Changes

server/internal/exporter/exporter.go — 8 insertions

Ascend-specific core recalculation before HamiContainerVcoreAllocated set

Summary by CodeRabbit

Bug Fixes
- Improved resource reporting for Ascend GPU containers so allocation values now better reflect memory-based usage when available.
- Made container metric calculations more accurate and consistent for GPU-backed workloads.

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

hami-robot · 2026-06-26T11:40:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: peachest

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-26T11:40:45Z

📝 Walkthrough

Walkthrough

GenerateContainerMetrics now recalculates core for Ascend GPU containers from allocated memory when device memory total is available.

Changes

Ascend core calculation

Layer / File(s)	Summary
Ascend memory-based core `server/internal/exporter/exporter.go`	`GenerateContainerMetrics` uses a memory-derived `core` value for Ascend GPU containers when `deviceMemTotal` returns a positive size.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

Poem

A bunny hopped through metrics bright,
And found Ascend cores in a new light.
🐰 Memory in, neat numbers bloom,
Less guesswork now in exporter room,
Hop hop — the cores feel just right.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main change: recalculating Ascend GPU vcore allocation from memory ratio.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/exporter/exporter.go`:
- Around line 389-392: The Ascend core calculation in the exporter currently
falls back to the stale cd.Usedcores value when deviceMemTotal fails or returns
an invalid size, which reintroduces the misleading metric. Update the logic
around the deviceMemTotal call in exporter.go so that this path does not publish
the old core value for Ascend; instead, skip setting
hami_container_vcore_allocated for that device or emit a neutral value and log
the failure with enough context to identify the device/provider.
- Around line 386-393: The Ascend core recalculation in exporter metrics now
differs from the container API, so `ContainerReply.AllocatedCores` in
`container.go` can report a different value than the metrics path for the same
workload. Update the API flow in the container service to apply the same Ascend
semantics used in `exporter.go` when building `ContainerReply.AllocatedCores`,
or otherwise make the divergence explicit and documented so both surfaces stay
consistent.
- Around line 388-390: Cache the result of device memory lookup in the exporter
hot path so it is computed once per device instead of once per container. In the
nested loop inside the exporter logic, move the s.deviceMemTotal call out of the
container loop and reuse the returned deviceMemSize for all matching containers
on the same device. Keep the existing AscendGPUDevice check and memory
percentage calculation, but ensure the cached value is scoped to the outer
device iteration and only reused when valid.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5cbbf314-871b-430b-b424-21c3ea5329cf

📥 Commits

Reviewing files that changed from the base of the PR and between 8f42445 and 6cf09ab.

📒 Files selected for processing (1)

server/internal/exporter/exporter.go

coderabbitai · 2026-06-26T11:42:50Z

+			// For Ascend GPU, Usedcores is a 1-100 value that doesn't reflect actual core allocation.
+			// Recalculate core as the percentage of device memory allocated to this container.
+			if provider == biz.AscendGPUDevice {
+				if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 {
+					perc := float32(memory) / deviceMemSize
+					core = int32(float32(100) * perc)
+				}
+			}


🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Keep the container API aligned with the new Ascend core semantics.

This fixes exporter metrics only; server/internal/service/container.go:74-113 still builds ContainerReply.AllocatedCores by summing containerDevice.Usedcores. After this change, Ascend “allocated cores” can differ between the metrics endpoint and the container API for the same workload. If those surfaces are meant to represent the same concept, update the API path as well or explicitly document the divergence.

🧰 Tools

🪛 ast-grep (0.44.0)

[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.

(integer-overflow-narrowing-conversion-go)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/exporter/exporter.go` around lines 386 - 393, The Ascend core recalculation in exporter metrics now differs from the container API, so `ContainerReply.AllocatedCores` in `container.go` can report a different value than the metrics path for the same workload. Update the API flow in the container service to apply the same Ascend semantics used in `exporter.go` when building `ContainerReply.AllocatedCores`, or otherwise make the divergence explicit and documented so both surfaces stay consistent.

coderabbitai · 2026-06-26T11:42:50Z

+			if provider == biz.AscendGPUDevice {
+				if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 {
+					perc := float32(memory) / deviceMemSize


🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win

Cache deviceMemTotal per device instead of querying it per container.

This branch sits inside the deviceInfos × containers nested loop, so every Ascend container on the same device triggers the same Prometheus lookup again. Because total memory is device-scoped and invariant here, this adds unnecessary external calls on a hot path and can inflate scrape latency under load. Compute/cache deviceMemSize once per outer device iteration and reuse it for all matching containers.

🧰 Tools

🪛 ast-grep (0.44.0)

[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.

(integer-overflow-narrowing-conversion-go)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/exporter/exporter.go` around lines 388 - 390, Cache the result of device memory lookup in the exporter hot path so it is computed once per device instead of once per container. In the nested loop inside the exporter logic, move the s.deviceMemTotal call out of the container loop and reuse the returned deviceMemSize for all matching containers on the same device. Keep the existing AscendGPUDevice check and memory percentage calculation, but ensure the cached value is scoped to the outer device iteration and only reused when valid.

coderabbitai · 2026-06-26T11:42:50Z

+				if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 {
+					perc := float32(memory) / deviceMemSize
+					core = int32(float32(100) * perc)
+				}


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't silently fall back to the stale Ascend core value on query failure.

If deviceMemTotal errors or returns <= 0, core stays at cd.Usedcores — the exact value this PR says is misleading for Ascend. That means transient metric backend failures will quietly reintroduce incorrect hami_container_vcore_allocated data. Prefer skipping this metric for Ascend when total memory is unavailable, or emitting a neutral value with an error log, rather than publishing the old semantics.

🧰 Tools

🪛 ast-grep (0.44.0)

[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.

(integer-overflow-narrowing-conversion-go)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/exporter/exporter.go` around lines 389 - 392, The Ascend core calculation in the exporter currently falls back to the stale cd.Usedcores value when deviceMemTotal fails or returns an invalid size, which reintroduces the misleading metric. Update the logic around the deviceMemTotal call in exporter.go so that this path does not publish the old core value for Ascend; instead, skip setting hami_container_vcore_allocated for that device or emit a neutral value and log the failure with enough context to identify the device/provider.

fix: correct Ascend GPU vcore allocation based on memory ratio

6cf09ab

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

hami-robot Bot added the dco-signoff: yes label Jun 26, 2026

hami-robot Bot added the size/XS label Jun 26, 2026

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: correct Ascend GPU vcore allocation based on memory ratio#107

fix: correct Ascend GPU vcore allocation based on memory ratio#107
peachest wants to merge 1 commit into
Project-HAMi:mainfrom
peachest:fix-ascend-vcore-allocation

peachest commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

hami-robot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

peachest commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Changes

Summary by CodeRabbit

Uh oh!

hami-robot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

peachest commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading