Skip to content

Add simple timer metrics for trainer profiling#425

Draft
mgoin wants to merge 1 commit into
mainfrom
simple-profiling
Draft

Add simple timer metrics for trainer profiling#425
mgoin wants to merge 1 commit into
mainfrom
simple-profiling

Conversation

@mgoin
Copy link
Copy Markdown
Member

@mgoin mgoin commented Apr 14, 2026

Commands:

python prepare_data.py --model Qwen/Qwen3-8B --data sharegpt --max-samples 2000 --seq-length 4096

# vLLM on GPU 2
CUDA_VISIBLE_DEVICES=2 python scripts/launch_vllm.py Qwen/Qwen3-8B \
    --hidden-states-path ./profile_run/hidden_states \
    -- --port 8000 --max-model-len 5120 --gpu-memory-utilization 0.85

# Trainer on GPU 3
CUDA_VISIBLE_DEVICES=3 python scripts/train.py \
    --verifier-name-or-path Qwen/Qwen3-8B \
    --data-path ./profile_run/data \
    --hidden-states-path ./profile_run/hidden_states \
    --vllm-endpoint http://localhost:8000/v1 \
    --save-path ./profile_run/checkpoint \
    --draft-vocab-size 32000 \
    --epochs 1 \
    --total-seq-len 4096 \
    --logger tensorboard \
    --log-dir ./profile_run/tb \
    --run-name qwen3_8b_profile \
    --log-interval 10 \
    --no-resume-from-checkpoint

Per-window profile output:

[win 0, warmup]  profile/fetch_ms=508.218 profile/h2d_ms=2.558  profile/fwd_ms=754.997 profile/bwd_ms=24.885 profile/opt_ms=26.411
profile/step_ms=1.32e+03 profile/tokens_per_s=3.11e+03 profile/fetch_frac=0.386 profile/gpu_util_pct=5.176
[win 1]          profile/fetch_ms=1.837   profile/h2d_ms=2.485  profile/fwd_ms=34.157  profile/bwd_ms=19.875 profile/opt_ms=5.641
profile/step_ms=63.996   profile/tokens_per_s=6.40e+04 profile/fetch_frac=0.029 profile/gpu_util_pct=90.000
[win 2]          profile/fetch_ms=1.912   profile/h2d_ms=3.463  profile/fwd_ms=13.503  profile/bwd_ms=20.270 profile/opt_ms=5.666
profile/step_ms=44.813   profile/tokens_per_s=9.14e+04 profile/fetch_frac=0.043 profile/gpu_util_pct=91.000
[win 3]          profile/fetch_ms=3.176   profile/h2d_ms=4.275  profile/fwd_ms=15.440  profile/bwd_ms=20.543 profile/opt_ms=5.927
profile/step_ms=49.361   profile/tokens_per_s=8.30e+04 profile/fetch_frac=0.064 profile/gpu_util_pct=80.500
[win 4]          profile/fetch_ms=2.408   profile/h2d_ms=2.492  profile/fwd_ms=12.996  profile/bwd_ms=19.709 profile/opt_ms=6.430
profile/step_ms=44.035   profile/tokens_per_s=9.30e+04 profile/fetch_frac=0.055 profile/gpu_util_pct=81.500
[win 5]          profile/fetch_ms=18.445  profile/h2d_ms=3.857  profile/fwd_ms=14.518  profile/bwd_ms=19.791 profile/opt_ms=6.570
profile/step_ms=63.180   profile/tokens_per_s=6.48e+04 profile/fetch_frac=0.292 profile/gpu_util_pct=75.500
[win 6]          profile/fetch_ms=53.687  profile/h2d_ms=2.460  profile/fwd_ms=12.457  profile/bwd_ms=19.482 profile/opt_ms=5.628
profile/step_ms=93.714   profile/tokens_per_s=4.37e+04 profile/fetch_frac=0.573 profile/gpu_util_pct=45.500
[win 7]          profile/fetch_ms=149.371 profile/h2d_ms=3.350  profile/fwd_ms=12.768  profile/bwd_ms=21.318 profile/opt_ms=5.646
profile/step_ms=192.453  profile/tokens_per_s=2.13e+04 profile/fetch_frac=0.776 profile/gpu_util_pct=23.875
[win 8]          profile/fetch_ms=143.065 profile/h2d_ms=2.433  profile/fwd_ms=14.576  profile/bwd_ms=22.914 profile/opt_ms=5.915
profile/step_ms=188.903  profile/tokens_per_s=2.17e+04 profile/fetch_frac=0.757 profile/gpu_util_pct=22.286
[win 9]          profile/fetch_ms=143.723 profile/h2d_ms=2.480  profile/fwd_ms=12.542  profile/bwd_ms=19.724 profile/opt_ms=5.636
profile/step_ms=184.105  profile/tokens_per_s=2.22e+04 profile/fetch_frac=0.781 profile/gpu_util_pct=26.000
[win 10]         profile/fetch_ms=2.424   profile/h2d_ms=2.452  profile/fwd_ms=13.355  profile/bwd_ms=19.999 profile/opt_ms=6.401
profile/step_ms=44.630   profile/tokens_per_s=9.18e+04 profile/fetch_frac=0.054 profile/gpu_util_pct=84.000
[win 11]         profile/fetch_ms=65.846  profile/h2d_ms=2.463  profile/fwd_ms=13.213  profile/bwd_ms=20.802 profile/opt_ms=6.507
profile/step_ms=108.831  profile/tokens_per_s=3.76e+04 profile/fetch_frac=0.605 profile/gpu_util_pct=46.000
[win 12]         profile/fetch_ms=87.614  profile/h2d_ms=3.186  profile/fwd_ms=16.729  profile/bwd_ms=21.317 profile/opt_ms=6.307
profile/step_ms=135.153  profile/tokens_per_s=3.03e+04 profile/fetch_frac=0.648 profile/gpu_util_pct=17.600
[win 13]         profile/fetch_ms=96.599  profile/h2d_ms=2.488  profile/fwd_ms=14.576  profile/bwd_ms=21.311 profile/opt_ms=5.698
profile/step_ms=140.671  profile/tokens_per_s=2.91e+04 profile/fetch_frac=0.687 profile/gpu_util_pct=29.333
Sustained throughput:
{'sustained_tokens_per_s': 40816.877346239264,
'sustained_fetch_frac':   0.5533778186932062,
'sustained_step_ms':      100.35064576975515,
'sustained_steps':        139.0}

Signed-off-by: mgoin <mgoin64@gmail.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5e501a85-0b0c-497a-aaf5-e94898121d61

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch simple-profiling

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 14, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant