Skip to content

Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner#19190

Open
digantdesai wants to merge 6 commits intogh/digantdesai/53/basefrom
gh/digantdesai/53/head
Open

Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner#19190
digantdesai wants to merge 6 commits intogh/digantdesai/53/basefrom
gh/digantdesai/53/head

Conversation

@digantdesai
Copy link
Copy Markdown
Contributor

@digantdesai digantdesai commented Apr 28, 2026

Stack from ghstack (oldest at bottom):

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

… runner

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 28, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19190

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 2 Cancelled Jobs, 1 Unrelated Failure

As of commit 87c9947 with merge base cb4e5ae (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 28, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…Qwen3.5 MoE runner"

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
digantdesai added a commit that referenced this pull request Apr 28, 2026
… runner

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

ghstack-source-id: c6f86d3
Pull Request resolved: #19190
…Qwen3.5 MoE runner"

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
digantdesai added a commit that referenced this pull request Apr 28, 2026
… runner

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

ghstack-source-id: b04dd37
Pull Request resolved: #19190
…Qwen3.5 MoE runner"

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
digantdesai added a commit that referenced this pull request Apr 28, 2026
… runner

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

ghstack-source-id: 1c45a1e
Pull Request resolved: #19190
…Qwen3.5 MoE runner"

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
digantdesai added a commit that referenced this pull request Apr 29, 2026
… runner

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

ghstack-source-id: fea9eb8
Pull Request resolved: #19190
Copy link
Copy Markdown
Contributor

@Gasoonjia Gasoonjia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thansk for adding so much details! My major concern is it may crash on non-cuda scenerio. Added ci labels to confirm

Comment thread examples/models/qwen3_5_moe/export.py Outdated


def _set_batched_moe(model, enabled, moe_moe_moe_activation_dtype="bf16"):
def _set_batched_moe(model, enabled, moe_moe_moe_moe_activation_dtype="bf16"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why another extra MoE lol

// GPU memory: before load
{
size_t free = 0, total = 0;
if (cudaMemGetInfo(&free, &total) == cudaSuccess) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now all cuda functions are not under EXECUTORCH_BUILD_CUDA marco; i think it will crash when mps or metal backend using this script

double decode_ms =
(double)(stats.inference_end_ms - stats.prompt_eval_end_ms);
printf(
"Prefill: %" PRId64 " tokens in %.1f ms (%.1f tok/s)\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to print prefill twice?

…Qwen3.5 MoE runner"

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

[ghstack-poisoned]
digantdesai added a commit that referenced this pull request Apr 29, 2026
… runner

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

ghstack-source-id: 9227519
Pull Request resolved: #19190
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda ciflow/metal ciflow/mlx CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants