fix: correct Ascend GPU vcore allocation based on memory ratio#107
fix: correct Ascend GPU vcore allocation based on memory ratio#107peachest wants to merge 1 commit into
Conversation
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: peachest The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthrough
ChangesAscend core calculation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~5 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/internal/exporter/exporter.go`:
- Around line 389-392: The Ascend core calculation in the exporter currently
falls back to the stale cd.Usedcores value when deviceMemTotal fails or returns
an invalid size, which reintroduces the misleading metric. Update the logic
around the deviceMemTotal call in exporter.go so that this path does not publish
the old core value for Ascend; instead, skip setting
hami_container_vcore_allocated for that device or emit a neutral value and log
the failure with enough context to identify the device/provider.
- Around line 386-393: The Ascend core recalculation in exporter metrics now
differs from the container API, so `ContainerReply.AllocatedCores` in
`container.go` can report a different value than the metrics path for the same
workload. Update the API flow in the container service to apply the same Ascend
semantics used in `exporter.go` when building `ContainerReply.AllocatedCores`,
or otherwise make the divergence explicit and documented so both surfaces stay
consistent.
- Around line 388-390: Cache the result of device memory lookup in the exporter
hot path so it is computed once per device instead of once per container. In the
nested loop inside the exporter logic, move the s.deviceMemTotal call out of the
container loop and reuse the returned deviceMemSize for all matching containers
on the same device. Keep the existing AscendGPUDevice check and memory
percentage calculation, but ensure the cached value is scoped to the outer
device iteration and only reused when valid.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 5cbbf314-871b-430b-b424-21c3ea5329cf
📒 Files selected for processing (1)
server/internal/exporter/exporter.go
| // For Ascend GPU, Usedcores is a 1-100 value that doesn't reflect actual core allocation. | ||
| // Recalculate core as the percentage of device memory allocated to this container. | ||
| if provider == biz.AscendGPUDevice { | ||
| if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 { | ||
| perc := float32(memory) / deviceMemSize | ||
| core = int32(float32(100) * perc) | ||
| } | ||
| } |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
Keep the container API aligned with the new Ascend core semantics.
This fixes exporter metrics only; server/internal/service/container.go:74-113 still builds ContainerReply.AllocatedCores by summing containerDevice.Usedcores. After this change, Ascend “allocated cores” can differ between the metrics endpoint and the container API for the same workload. If those surfaces are meant to represent the same concept, update the API path as well or explicitly document the divergence.
🧰 Tools
🪛 ast-grep (0.44.0)
[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.
(integer-overflow-narrowing-conversion-go)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@server/internal/exporter/exporter.go` around lines 386 - 393, The Ascend core
recalculation in exporter metrics now differs from the container API, so
`ContainerReply.AllocatedCores` in `container.go` can report a different value
than the metrics path for the same workload. Update the API flow in the
container service to apply the same Ascend semantics used in `exporter.go` when
building `ContainerReply.AllocatedCores`, or otherwise make the divergence
explicit and documented so both surfaces stay consistent.
| if provider == biz.AscendGPUDevice { | ||
| if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 { | ||
| perc := float32(memory) / deviceMemSize |
There was a problem hiding this comment.
🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win
Cache deviceMemTotal per device instead of querying it per container.
This branch sits inside the deviceInfos × containers nested loop, so every Ascend container on the same device triggers the same Prometheus lookup again. Because total memory is device-scoped and invariant here, this adds unnecessary external calls on a hot path and can inflate scrape latency under load. Compute/cache deviceMemSize once per outer device iteration and reuse it for all matching containers.
🧰 Tools
🪛 ast-grep (0.44.0)
[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.
(integer-overflow-narrowing-conversion-go)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@server/internal/exporter/exporter.go` around lines 388 - 390, Cache the
result of device memory lookup in the exporter hot path so it is computed once
per device instead of once per container. In the nested loop inside the exporter
logic, move the s.deviceMemTotal call out of the container loop and reuse the
returned deviceMemSize for all matching containers on the same device. Keep the
existing AscendGPUDevice check and memory percentage calculation, but ensure the
cached value is scoped to the outer device iteration and only reused when valid.
| if deviceMemSize, err := s.deviceMemTotal(ctx, provider, device.Id); err == nil && deviceMemSize > 0 { | ||
| perc := float32(memory) / deviceMemSize | ||
| core = int32(float32(100) * perc) | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Don't silently fall back to the stale Ascend core value on query failure.
If deviceMemTotal errors or returns <= 0, core stays at cd.Usedcores — the exact value this PR says is misleading for Ascend. That means transient metric backend failures will quietly reintroduce incorrect hami_container_vcore_allocated data. Prefer skipping this metric for Ascend when total memory is unavailable, or emitting a neutral value with an error log, rather than publishing the old semantics.
🧰 Tools
🪛 ast-grep (0.44.0)
[warning] 390-390: Narrowing a non-constant integer to a smaller fixed-width type (int8/int16/int32, uint8/uint16/uint32) can silently overflow or wrap, yielding negative or truncated values that are dangerous in size, length, or index logic. Validate the source value is within the target type's range before converting (e.g. bounds-check, or use a checked helper), and avoid narrowing untrusted or len()/parsed values.
Context: int32(float32(100) * perc)
Note: [CWE-190] Integer Overflow or Wraparound.
(integer-overflow-narrowing-conversion-go)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@server/internal/exporter/exporter.go` around lines 389 - 392, The Ascend core
calculation in the exporter currently falls back to the stale cd.Usedcores value
when deviceMemTotal fails or returns an invalid size, which reintroduces the
misleading metric. Update the logic around the deviceMemTotal call in
exporter.go so that this path does not publish the old core value for Ascend;
instead, skip setting hami_container_vcore_allocated for that device or emit a
neutral value and log the failure with enough context to identify the
device/provider.
Problem
For Ascend GPU,
Usedcoresis a static value (25, 50, etc.) that reflects a fixed mapping to memory size rather than the actual core allocation ratio on the device. SettingHamiContainerVcoreAllocatedto this raw value produces misleading metrics.Fix
Before setting
HamiContainerVcoreAllocatedfor Ascend devices, recalculatecoreas the percentage of device memory allocated to this container relative to total device memory:This matches the intended semantics of
VcoreAllocated— the share of the device allocated to the container.Changes
server/internal/exporter/exporter.go— 8 insertionsHamiContainerVcoreAllocatedsetSummary by CodeRabbit