Skip to content

feat: add DCMI process-level memory tracking and container metrics monitor#87

Open
maverick123123 wants to merge 1 commit into
Project-HAMi:mainfrom
maverick123123:feat/dcmi-process-memory-monitor
Open

feat: add DCMI process-level memory tracking and container metrics monitor#87
maverick123123 wants to merge 1 commit into
Project-HAMi:mainfrom
maverick123123:feat/dcmi-process-memory-monitor

Conversation

@maverick123123

Copy link
Copy Markdown

Summary

Add Prometheus metrics monitoring with DCMI-based process-level memory tracking for Ascend NPU containers.

Changes

Go (ascend-device-plugin):

  • Add Prometheus metrics monitor (internal/monitor/) with DCMI host + per-container vGPU metrics
  • Per-device container memory tracking via ReadMemoryByDevice (HBM per NPU device)
  • Per-container shared memory setup in Allocate (hami-vnpu-shmem mount + NPU_LOCAL_SHM_PATH env)
  • Terminal pod cleanup for stale container shmem directories
  • Fix GRPC NewClient passthrough resolver for Unix sockets
  • Support multiple device UUIDs per container (TP=2 multi-device)
  • Use DCMI raw values for host_gpu_memory metrics

Verified on Ascend 310P:

  • vLLM TP=2 dual-card inference with DCMI process memory tracking
  • Two pods running simultaneously with per-device, per-pod metrics

@hami-robot

hami-robot Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maverick123123
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot

hami-robot Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Welcome @maverick123123! It looks like this is your first PR to Project-HAMi/ascend-device-plugin 🎉

@hami-robot hami-robot Bot added the size/XL label Jun 17, 2026
…nitor

- Add Prometheus metrics monitor (internal/monitor/) with DCMI host + per-container metrics
- Add per-device container memory tracking via ReadMemoryByDevice
- Add per-container shared memory setup in Allocate (hami-vnpu-shmem mount + NPU_LOCAL_SHM_PATH)
- Add terminal pod cleanup for stale container shmem directories
- Fix GRPC NewClient passthrough resolver for Unix sockets
- Support multiple device UUIDs per container (TP=2 multi-device)
- Use DCMI raw values for host_gpu_memory (remove shmem overlay)

Signed-off-by: maverick123123 <yuming.wu@dynamia.ai>
@maverick123123 maverick123123 force-pushed the feat/dcmi-process-memory-monitor branch 2 times, most recently from bebf4ea to 7ca13d8 Compare June 17, 2026 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant