Hami update complete#1
Conversation
…n are uses, as woker are expetec to be in waiting already, so no need to wait longer Signed-off-by: m84396954 <mehryar.abbasi1@h-partners.com>
Signed-off-by: m84396954 <mehryar.abbasi1@h-partners.com>
Signed-off-by: m84396954 <mehryar.abbasi1@h-partners.com>
Signed-off-by: m84396954 <mehryar.abbasi1@h-partners.com>
Signed-off-by: m84396954 <mehryar.abbasi1@h-partners.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mehryar72 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
9ece591 to
aa1ee1e
Compare
There was a problem hiding this comment.
Pull request overview
This PR extends the NPU limiter/virtualization stack with a control-plane (gRPC over Unix socket) to dynamically update priority at runtime and to report per-device utilization/memory, plus adds scripts/docs to run vLLM and multi-process load tests under HAMI/VNPU.
Changes:
- Add gRPC UDS server to the limiter daemon for
SetPriorityandGetUtilization(with reflection), plus utilization + memory reporting internals. - Update worker/manager shared-memory handling for multi-device scenarios (per-physical-device local shmem naming, device-change handling).
- Add runnable scripts (vLLM interactive + multi-process torch test) and expanded docs/README for debugging and gRPC usage.
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| script/mab/run_vllm_interactive.sh | New script to launch limiter + vLLM in a container and run an interactive streaming benchmark client. |
| script/mab/run_hami_mab_multi.sh | New script to launch limiter and run multiple NPU worker processes for load testing. |
| script/mab/run_hami_mab.sh | Exposes limiter gRPC UDS path and makes shared dir configurable via env override. |
| script/mab/multi_process_test.py | New multi-process PyTorch NPU inference loop used by the multi runner. |
| script/mab/clean_up.sh | Extends cleanup to stop new container name pattern and remove additional logs/registry files. |
| docs/limiter-debug-analysis.md | New debugging/analysis doc describing limiter “no progress” behavior and intended improvements. |
| crates/limiter/src/worker.rs | Refactors SchedulerClient init/lifecycle for multi-device support and device-change-safe teardown. |
| crates/limiter/src/shmem.rs | Adds try_open_shmem and introduces per-physical-device local SHM naming helper. |
| crates/limiter/src/reporter.rs | New utilization reporter thread that tracks baton ownership and computes rolling utilization. |
| crates/limiter/src/memory_report.rs | New memory metrics helper (quota-based when enforced, runtime query otherwise). |
| crates/limiter/src/manager.rs | Makes priority dynamically readable (atomic) and tweaks scheduling constants/behavior. |
| crates/limiter/src/main.rs | Adds gRPC control server over UDS and spawns per-device manager + reporter threads. |
| crates/limiter/src/lib.rs | Exposes new reporter and memory_report modules. |
| crates/limiter/src/externed_api.rs | Adds missing runtime externs used for device detection and resource destruction. |
| crates/limiter/build.rs | Adds tonic-build proto compilation + descriptor set generation. |
| crates/limiter/Cargo.toml | Adds tonic/tokio/prost and build-dependency for proto compilation. |
| crates/hook/src/hook.rs | Adds rtSetDevice hook + HCCL stream-op hooks; replaces lazy_static-based limiter singleton. |
| README.md | Documents gRPC dynamic priority and utilization/memory reporting usage via grpcurl. |
| .gitignore | Adds a repository-wide ignore list (including a *.proto rule). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap()); | ||
| tonic_build::configure() | ||
| .file_descriptor_set_path(out_dir.join("limiter_descriptor.bin")) | ||
| .compile_protos(&["proto/limiter.proto"], &["proto"])?; | ||
|
|
There was a problem hiding this comment.
tonic_build::compile_protos is configured to compile proto/limiter.proto, but there is no crates/limiter/proto/ directory (and no limiter.proto) in this branch, so the crate will not build. Add the missing proto file/directory (and ensure it’s committed), or update compile_protos to point at the correct existing location.
| let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap()); | |
| tonic_build::configure() | |
| .file_descriptor_set_path(out_dir.join("limiter_descriptor.bin")) | |
| .compile_protos(&["proto/limiter.proto"], &["proto"])?; | |
| let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap()); | |
| let manifest_dir = PathBuf::from(env::var("CARGO_MANIFEST_DIR").unwrap()); | |
| let proto_dir = manifest_dir.join("../../proto"); | |
| let limiter_proto = proto_dir.join("limiter.proto"); | |
| tonic_build::configure() | |
| .file_descriptor_set_path(out_dir.join("limiter_descriptor.bin")) | |
| .compile_protos(&[limiter_proto], &[proto_dir])?; |
| if !saw_progress && start_time.elapsed() > Duration::from_micros(50) { | ||
| debug!("[Manager] --- no progress for 5ms, enter STATE_MEASURING early"); | ||
| break; |
There was a problem hiding this comment.
The no-progress cutoff is now Duration::from_micros(50) (50µs) but the log message still says "5ms". 50µs is orders of magnitude smaller than the previous behavior and is likely to trigger false "no progress" early exits under normal kernel launch latency. Consider restoring a ms-scale threshold (or making it configurable via an env var as described in the docs) and update the log message to match the actual unit/value.
| if new_priority <= 0.0 { | ||
| return Err(Status::invalid_argument("Priority must be > 0")); | ||
| } | ||
|
|
||
| self.priority_atomic.store(new_priority.to_bits(), Ordering::Relaxed); | ||
| log::info!("Updated priority via gRPC to: {}", new_priority); | ||
|
|
There was a problem hiding this comment.
set_priority only checks new_priority <= 0.0. This accepts NaN / +inf (because NaN <= 0.0 is false), which would then be stored into the shared atomic and can break token math (clamp, casts, rounding). Validate new_priority.is_finite() (and optionally enforce a reasonable upper bound) before storing it.
| if new_priority <= 0.0 { | |
| return Err(Status::invalid_argument("Priority must be > 0")); | |
| } | |
| self.priority_atomic.store(new_priority.to_bits(), Ordering::Relaxed); | |
| log::info!("Updated priority via gRPC to: {}", new_priority); | |
| if !new_priority.is_finite() || new_priority <= 0.0 { | |
| return Err(Status::invalid_argument("Priority must be a finite value > 0")); | |
| } | |
| self.priority_atomic.store(new_priority.to_bits(), Ordering::Relaxed); | |
| log::info!("Updated priority via gRPC to: {}", new_priority); |
| # --host-port <int> Host port for vLLM (default: 9000+id) | ||
| # --tokens <int> Max tokens for each run (default: 2000) | ||
| # --label <string> Label shown in benchmark output | ||
| # --image <string> Container image id/tag (default set below) | ||
| # --ready-timeout <int> Seconds to wait for vLLM readiness (default: 45) | ||
| # | ||
| # Controls inside the benchmark: | ||
| # s/b : start or restart generation | ||
| # p : pause current generation and show averages | ||
| # q : quit (also tears down the container) | ||
|
|
||
| usage() { | ||
| cat <<EOF | ||
| Usage: $0 --id <ID> --model <MODEL_PATH> --rtv <LIST> --tp <N> --mem-quota <MB> --priority <PERCENT> [options] | ||
|
|
||
| Required: | ||
| --id <ID> Unique instance id (also sets default ports) | ||
| --model <PATH> Host path to model directory | ||
| --rtv <LIST> ASCEND_RT_VISIBLE_DEVICES list (e.g. 0,1) | ||
| --tp <N> Tensor parallel size | ||
| --mem-quota <MB> NPU_MEM_QUOTA value (MB) | ||
| --priority <PERCENT> NPU_PRIORITY / share percent | ||
|
|
||
| Optional: | ||
| --host-port <PORT> Host port for vLLM (default: 9000+ID) | ||
| --tokens <N> Max tokens per run (default: 2000) | ||
| --label <TEXT> Label shown in benchmark output | ||
| --image <TAG> Container image (default: registry-cbu.huawei.com/ascend/vllm-ascend:v0.10.1rc1) | ||
| --ready-timeout <S> Seconds to wait for vLLM to be ready (default: 45) | ||
| -h|--help Show this help |
There was a problem hiding this comment.
The help text says --host-port defaults to 9000+ID and --ready-timeout defaults to 45, but the actual defaults are HOST_VLLM_PORT=$((9600 + ID)) and READY_TIMEOUT=300. Update the usage/comments (or the code) so the documented defaults match runtime behavior.
| export NPU_PRIORITY="$NPU_PRIORITY" | ||
| export NPU_MEM_QUOTA="$NPU_MEM_QUOTA" | ||
| export NPU_GLOBAL_SHM_PATH="$GLOBAL_SHM_PATH" | ||
| export NPU_LOCAL_SHM_NAME="vnpu_local_session_${ID}" |
There was a problem hiding this comment.
NPU_LOCAL_SHM_NAME is used as the base name for shm_open in shmem::shm_setup::{create_shmem,open_shmem}. On Linux, shm_open expects names to start with /; using vnpu_local_session_${ID} may fail with EINVAL. Consider prefixing the value with / here (or normalizing inside local_shmem_name_for / create_shmem).
| export NPU_LOCAL_SHM_NAME="vnpu_local_session_${ID}" | |
| export NPU_LOCAL_SHM_NAME="/vnpu_local_session_${ID}" |
| ## Implemented Improvements | ||
|
|
||
| 1. **Manager identifier in logs** – All log lines include `[Manager #N]` where N is the global slot index. | ||
| 2. **Configurable no-progress threshold** – Set `NPU_NO_PROGRESS_MS` (default 5). For vLLM, try e.g. `NPU_NO_PROGRESS_MS=50`. | ||
| 3. **Tiered log levels** – Use `RUST_LOG` to control verbosity: |
There was a problem hiding this comment.
This doc says the no-progress threshold is configurable via NPU_NO_PROGRESS_MS (default 5), but the current code in crates/limiter/src/manager.rs uses a hard-coded Duration::from_micros(50) and doesn’t read NPU_NO_PROGRESS_MS. Either implement the env var as documented, or update this section to match actual behavior.
| // Map each device's local shmem for quota / used reads (managers created it above). | ||
| let local_shmems: Vec<&'static LocalContainerShmem> = devices | ||
| .iter() | ||
| .map(|dev| { | ||
| let path = local_shmem_name_for(*dev); | ||
| loop { | ||
| if let Some(ptr) = shm_setup::try_open_shmem::<LocalContainerShmem>(&path) { | ||
| break unsafe { &*(ptr as *const LocalContainerShmem) }; | ||
| } | ||
| thread::sleep(Duration::from_millis(20)); | ||
| } |
There was a problem hiding this comment.
The loop that maps local_shmems retries forever until try_open_shmem succeeds. If a manager thread panics or SHM creation fails, the daemon will hang at startup with no error. Add a timeout / max retries (and surface an actionable error) so failures don’t turn into an infinite wait.
| sudo rm -f ${HOST_PROJECT_DIR}/inst*_manager.log | ||
| sudo rm -f ${HOST_PROJECT_DIR}/inst*_app.log | ||
| sudo rm -f ${HOST_PROJECT_DIR}/vllm_*.log | ||
| sudo rm -f ${HOST_PROJECT_DIR}/limiter_*.log |
There was a problem hiding this comment.
clean_up.sh removes inst*_app.log, but the new multi-process runner writes ${LOG_PREFIX}_apps.log (plural) via tee. Add that pattern (and consider quoting the paths) so the cleanup script actually removes logs produced by the new scripts.
| sudo rm -f ${HOST_PROJECT_DIR}/inst*_manager.log | |
| sudo rm -f ${HOST_PROJECT_DIR}/inst*_app.log | |
| sudo rm -f ${HOST_PROJECT_DIR}/vllm_*.log | |
| sudo rm -f ${HOST_PROJECT_DIR}/limiter_*.log | |
| sudo rm -f "${HOST_PROJECT_DIR}"/inst*_manager.log | |
| sudo rm -f "${HOST_PROJECT_DIR}"/inst*_app.log | |
| sudo rm -f "${HOST_PROJECT_DIR}"/inst*_apps.log | |
| sudo rm -f "${HOST_PROJECT_DIR}"/vllm_*.log | |
| sudo rm -f "${HOST_PROJECT_DIR}"/limiter_*.log |
Signed-off-by: m84396954 <mehryar.abbasi1@h-partners.com>
… used. Signed-off-by: m84396954 <mehryar.abbasi1@h-partners.com>
a877601 to
51d29f4
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #![allow(non_snake_case)] | ||
| use lazy_static::lazy_static; | ||
| use limiter::worker::SchedulerClient; | ||
| use once_cell; |
There was a problem hiding this comment.
use once_cell; appears unused (the file uses once_cell::sync::Lazy directly). If the crate has #![deny(warnings)] in CI, this will fail the build; otherwise it’s still noise. Remove the unused import.
| use once_cell; |
| local.memory_limit = AtomicU64::new(memory_limit_bytes); | ||
| local.memory_used = AtomicU64::new(0); |
There was a problem hiding this comment.
local.memory_limit = AtomicU64::new(...) and local.memory_used = AtomicU64::new(...) replace the AtomicU64 objects via plain (non-atomic) writes. If any other thread reads these fields with atomic ops (e.g., gRPC GetUtilization calling memory_metrics) concurrently, this is a data race/UB under Rust's memory model. Prefer initializing/updating these fields via store(...) on the existing atomics instead of assigning new AtomicU64 values.
| local.memory_limit = AtomicU64::new(memory_limit_bytes); | |
| local.memory_used = AtomicU64::new(0); | |
| local.memory_limit.store(memory_limit_bytes, Ordering::Relaxed); | |
| local.memory_used.store(0, Ordering::Relaxed); |
| thread::sleep(OPEN_LOCAL_SHMEM_POLL); | ||
| }; | ||
|
|
||
| let local_path = local_shmem_name(); | ||
| // 1.Create SHM | ||
| let global_reg = shm_setup::open_global_registry::<GlobalRegistry>(&global_path); | ||
| let local_shm = shm_setup::create_shmem::<LocalContainerShmem>(local_path.as_str()); | ||
| local_shmems.push(unsafe { &*(ptr as *const LocalContainerShmem) }); | ||
| } |
There was a problem hiding this comment.
try_open_shmem returns &'static mut LocalContainerShmem, but the result is later cast to &LocalContainerShmem and stored. Creating a shared reference while the original &mut exists (even briefly) is undefined behavior. Prefer using raw pointers for the mapping (or change try_open_shmem to return a raw pointer/shared ref) so you never materialize an &mut when only read access is needed.
| let raw = std::env::var("ASCEND_RT_VISIBLE_DEVICES").unwrap_or_else(|_| "0".to_string()); | ||
| let mut set = BTreeSet::new(); | ||
| for v in raw.split(',').filter_map(|s| s.trim().parse::<u32>().ok()) { | ||
| set.insert(v); | ||
| } |
There was a problem hiding this comment.
Using a BTreeSet here deduplicates and sorts ASCEND_RT_VISIBLE_DEVICES, which can change the visible-device ordering. If the runtime uses env order for logical device indices, this can make idx as i32 (used later for rtSetDevice in memory reporting) refer to a different physical device than the returned device_id. Preserve the original env order instead of sorting.
| // 1. Open shared memory. `open_global_registry` returns a | ||
| // `&'static mut GlobalRegistry`; the reporter needs a shared | ||
| // `&'static GlobalRegistry` view of the same mapping, and the | ||
| // manager (further below) consumes the same shared reference. | ||
| // All fields are atomics, so aliasing as `&` is sound. | ||
| let global_reg_mut: &'static mut GlobalRegistry = | ||
| shm_setup::open_global_registry::<GlobalRegistry>(&global_path); | ||
| let global_reg: &'static GlobalRegistry = | ||
| unsafe { &*(global_reg_mut as *const GlobalRegistry) }; | ||
| let local_shm = shm_setup::create_shmem::<LocalContainerShmem>(local_path.as_str()); | ||
|
|
||
| // 2. Start the utilization reporter. It locates its slot by | ||
| // matching on pid, which is written during `ContainerManager::new` | ||
| // just below, so a brief retry inside the reporter is expected. | ||
| reporter.start( | ||
| global_reg, | ||
| unsafe { &*(local_shm as *const LocalContainerShmem) }, | ||
| pid_for_thread, | ||
| ); | ||
|
|
||
| // 3. Initialize the manager and enter its scheduling loop. | ||
| let mut manager = | ||
| ContainerManager::new(global_reg, local_shm, pid_for_thread as i32, p_atomic); |
There was a problem hiding this comment.
The current pattern keeps an &'static mut GlobalRegistry (global_reg_mut) alive while also creating an &'static GlobalRegistry pointing at the same mapping. Holding & and &mut aliases to the same memory is undefined behavior in Rust, even if you only access atomics. Consider changing open_global_registry/create_shmem call sites to work with raw pointers (or return a shared reference from the API) so you never have overlapping &mut + & lifetimes for the same region.
| // 1. Open shared memory. `open_global_registry` returns a | |
| // `&'static mut GlobalRegistry`; the reporter needs a shared | |
| // `&'static GlobalRegistry` view of the same mapping, and the | |
| // manager (further below) consumes the same shared reference. | |
| // All fields are atomics, so aliasing as `&` is sound. | |
| let global_reg_mut: &'static mut GlobalRegistry = | |
| shm_setup::open_global_registry::<GlobalRegistry>(&global_path); | |
| let global_reg: &'static GlobalRegistry = | |
| unsafe { &*(global_reg_mut as *const GlobalRegistry) }; | |
| let local_shm = shm_setup::create_shmem::<LocalContainerShmem>(local_path.as_str()); | |
| // 2. Start the utilization reporter. It locates its slot by | |
| // matching on pid, which is written during `ContainerManager::new` | |
| // just below, so a brief retry inside the reporter is expected. | |
| reporter.start( | |
| global_reg, | |
| unsafe { &*(local_shm as *const LocalContainerShmem) }, | |
| pid_for_thread, | |
| ); | |
| // 3. Initialize the manager and enter its scheduling loop. | |
| let mut manager = | |
| ContainerManager::new(global_reg, local_shm, pid_for_thread as i32, p_atomic); | |
| // 1. Open shared memory, but convert the returned mutable | |
| // references to raw pointers immediately so we do not keep | |
| // overlapping `&mut` and `&` bindings alive for the same | |
| // mapped region. | |
| let global_reg_ptr = | |
| shm_setup::open_global_registry::<GlobalRegistry>(&global_path) | |
| as *mut GlobalRegistry; | |
| let local_shm_ptr = | |
| shm_setup::create_shmem::<LocalContainerShmem>(local_path.as_str()) | |
| as *mut LocalContainerShmem; | |
| // 2. Start the utilization reporter. It locates its slot by | |
| // matching on pid, which is written during `ContainerManager::new` | |
| // just below, so a brief retry inside the reporter is expected. | |
| reporter.start( | |
| unsafe { &*global_reg_ptr }, | |
| unsafe { &*local_shm_ptr }, | |
| pid_for_thread, | |
| ); | |
| // 3. Initialize the manager and enter its scheduling loop. | |
| let mut manager = unsafe { | |
| ContainerManager::new( | |
| &*global_reg_ptr, | |
| &mut *local_shm_ptr, | |
| pid_for_thread as i32, | |
| p_atomic, | |
| ) | |
| }; |
| let local_shm = shm_setup::create_shmem::<LocalContainerShmem>(local_path.as_str()); | ||
|
|
||
| // 2. Start the utilization reporter. It locates its slot by | ||
| // matching on pid, which is written during `ContainerManager::new` | ||
| // just below, so a brief retry inside the reporter is expected. | ||
| reporter.start( | ||
| global_reg, | ||
| unsafe { &*(local_shm as *const LocalContainerShmem) }, | ||
| pid_for_thread, | ||
| ); | ||
|
|
||
| // 3. Initialize the manager and enter its scheduling loop. | ||
| let mut manager = | ||
| ContainerManager::new(global_reg, local_shm, pid_for_thread as i32, p_atomic); |
There was a problem hiding this comment.
local_shm is an &'static mut LocalContainerShmem, but this passes a shared &LocalContainerShmem to the reporter via pointer cast while the mutable reference is still live. Holding & and &mut aliases to the same region is undefined behavior in Rust. Consider using raw pointers for the mapping (or changing the shmem APIs) so you never have overlapping mutable/shared references.
| let local_shm = shm_setup::create_shmem::<LocalContainerShmem>(local_path.as_str()); | |
| // 2. Start the utilization reporter. It locates its slot by | |
| // matching on pid, which is written during `ContainerManager::new` | |
| // just below, so a brief retry inside the reporter is expected. | |
| reporter.start( | |
| global_reg, | |
| unsafe { &*(local_shm as *const LocalContainerShmem) }, | |
| pid_for_thread, | |
| ); | |
| // 3. Initialize the manager and enter its scheduling loop. | |
| let mut manager = | |
| ContainerManager::new(global_reg, local_shm, pid_for_thread as i32, p_atomic); | |
| // Use separate mappings for the reporter and manager so we never | |
| // create overlapping `&LocalContainerShmem` / `&mut LocalContainerShmem` | |
| // references to the same Rust object. | |
| let local_shm_for_manager = | |
| shm_setup::create_shmem::<LocalContainerShmem>(local_path.as_str()); | |
| let local_shm_for_reporter = | |
| shm_setup::open_shmem::<LocalContainerShmem>(local_path.as_str()); | |
| // 2. Start the utilization reporter. It locates its slot by | |
| // matching on pid, which is written during `ContainerManager::new` | |
| // just below, so a brief retry inside the reporter is expected. | |
| reporter.start(global_reg, local_shm_for_reporter, pid_for_thread); | |
| // 3. Initialize the manager and enter its scheduling loop. | |
| let mut manager = ContainerManager::new( | |
| global_reg, | |
| local_shm_for_manager, | |
| pid_for_thread as i32, | |
| p_atomic, | |
| ); |
| // Soften permissions so host users can access it without sudo | ||
| if let Ok(mut perms) = std::fs::metadata(&uds_path).map(|m| m.permissions()) { | ||
| perms.set_mode(0o777); |
There was a problem hiding this comment.
The UDS socket is chmod'd to 0o777, making it world-writable. That allows any local user on the host to change container priority and query utilization if they can access the mount path. Consider defaulting to a less permissive mode (e.g., 0o660) and/or gating the relaxation behind an env var so operators can choose the risk level.
| // Soften permissions so host users can access it without sudo | |
| if let Ok(mut perms) = std::fs::metadata(&uds_path).map(|m| m.permissions()) { | |
| perms.set_mode(0o777); | |
| // Default to owner/group access; allow world-writable only as an explicit opt-in. | |
| let uds_mode = match std::env::var("NPU_LIMITER_UDS_WORLD_WRITABLE") { | |
| Ok(value) if value == "1" || value.eq_ignore_ascii_case("true") => 0o777, | |
| _ => 0o660, | |
| }; | |
| if let Ok(mut perms) = std::fs::metadata(&uds_path).map(|m| m.permissions()) { | |
| perms.set_mode(uds_mode); |
| -v $HOST_PROJECT_DIR:$CONTAINER_PROJECT_DIR \ | ||
| -v /mnt/nvme0/LLMs/:/models \ | ||
| --name "$CONTAINER_NAME" \ | ||
| -e LD_PRELOAD=$LIBRARY_PATH \ |
There was a problem hiding this comment.
This sets LD_PRELOAD=$LIBRARY_PATH at the container level, so the limiter daemon itself (started later) will run under the hook library. That can cause the manager process to initialize SchedulerClient unexpectedly (e.g., via rtSetDevice / rtMemGetInfoEx in gRPC memory reporting) and register as a worker in local shmem. Recommend removing container-wide LD_PRELOAD and only setting it for the workload processes.
| -e LD_PRELOAD=$LIBRARY_PATH \ |
this branch , holds features such as dynamic NPU priority change and utilization report.