Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner#19190
Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner#19190digantdesai wants to merge 6 commits intogh/digantdesai/53/basefrom
Conversation
… runner Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19190
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New Failures, 2 Cancelled Jobs, 1 Unrelated FailureAs of commit 87c9947 with merge base cb4e5ae ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
…Qwen3.5 MoE runner" Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. [ghstack-poisoned]
… runner Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. ghstack-source-id: c6f86d3 Pull Request resolved: #19190
…Qwen3.5 MoE runner" Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. [ghstack-poisoned]
… runner Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. ghstack-source-id: b04dd37 Pull Request resolved: #19190
…Qwen3.5 MoE runner" Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. [ghstack-poisoned]
… runner Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. ghstack-source-id: 1c45a1e Pull Request resolved: #19190
…Qwen3.5 MoE runner" Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. [ghstack-poisoned]
… runner Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. ghstack-source-id: fea9eb8 Pull Request resolved: #19190
|
|
||
|
|
||
| def _set_batched_moe(model, enabled, moe_moe_moe_activation_dtype="bf16"): | ||
| def _set_batched_moe(model, enabled, moe_moe_moe_moe_activation_dtype="bf16"): |
There was a problem hiding this comment.
why another extra MoE lol
| // GPU memory: before load | ||
| { | ||
| size_t free = 0, total = 0; | ||
| if (cudaMemGetInfo(&free, &total) == cudaSuccess) { |
There was a problem hiding this comment.
right now all cuda functions are not under EXECUTORCH_BUILD_CUDA marco; i think it will crash when mps or metal backend using this script
| double decode_ms = | ||
| (double)(stats.inference_end_ms - stats.prompt_eval_end_ms); | ||
| printf( | ||
| "Prefill: %" PRId64 " tokens in %.1f ms (%.1f tok/s)\n", |
There was a problem hiding this comment.
why do we want to print prefill twice?
…Qwen3.5 MoE runner" Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. [ghstack-poisoned]
… runner Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. ghstack-source-id: 9227519 Pull Request resolved: #19190
Stack from ghstack (oldest at bottom):
Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.
This commit was authored with the assistance of Claude Code.