Skip to content

[Perf Regression] 32 config(s) regressed @ 6619cc7d #830

@github-actions

Description

@github-actions

Performance Regression Detected

Commit: 6619cc7d
Run: https://github.com/ROCm/ATOM/actions/runs/26050231461
Date: 2026-05-19T05:00:10.609891+00:00

Regressed Configurations

Model ISL/OSL Conc Tput (cur) Tput (base) Δ% TPOT (cur) TPOT (base) Δ%
DeepSeek-R1-0528 1024/1024 8 580.9 679.2 -14.5% 13.33 11.41 16.9%
DeepSeek-R1-0528 8192/1024 4 312.7 330.3 -5.3% 12.14 11.49 5.7%
DeepSeek-R1-0528 8192/1024 8 569.9 587.1 -2.9% 13.36 13.00 2.7%
DeepSeek-R1-0528 MTP3 1024/1024 4 405.1 567.5 -28.6% 9.35 6.70 39.5%
DeepSeek-R1-0528 MTP3 1024/1024 16 1081.2 1440.9 -25.0% 14.36 10.48 37.1%
DeepSeek-R1-0528 MTP3 1024/1024 32 2091.8 2465.0 -15.1% 14.08 12.36 13.9%
DeepSeek-R1-0528 MTP3 8192/1024 4 479.6 485.6 -1.2% 7.50 7.53 -0.5%
DeepSeek-R1-0528 MTP3 8192/1024 16 995.2 1117.1 -10.9% 11.82 13.15 -10.2%
DeepSeek-R1-0528 MTP3 8192/1024 32 1487.7 1575.8 -5.6% 19.44 16.94 14.7%
DeepSeek-R1-0528 MTP3 8192/1024 64 2025.8 2155.2 -6.0% 27.93 27.34 2.2%
DeepSeek-R1-0528-MXFP4 1024/1024 16 1127.5 1137.3 -0.9% 13.58 13.55 0.2%
DeepSeek-R1-0528-MXFP4 1024/1024 32 1716.0 1758.4 -2.4% 17.97 17.55 2.4%
DeepSeek-R1-0528-MXFP4 MTP3 1024/1024 4 617.1 633.6 -2.6% 5.88 6.07 -3.2%
DeepSeek-R1-0528-MXFP4 MTP3 1024/1024 16 1543.0 1571.3 -1.8% 9.76 9.70 0.6%
DeepSeek-R1-0528-MXFP4 MTP3 1024/1024 32 2328.8 2026.0 14.9% 12.95 15.15 -14.6%
DeepSeek-V4-Pro 1024/1024 4 170.1 175.3 -3.0% 21.70 21.86 -0.7%
DeepSeek-V4-Pro 1024/1024 8 336.0 331.3 1.4% 22.84 23.36 -2.2%
DeepSeek-V4-Pro 1024/1024 64 1545.3 1535.7 0.6% 39.55 40.00 -1.1%
DeepSeek-V4-Pro 1024/1024 128 2441.5 2393.8 2.0% 50.31 51.44 -2.2%
DeepSeek-V4-Pro 8192/1024 64 1005.9 995.8 1.0% 59.83 60.88 -1.7%
DeepSeek-V4-Pro 8192/1024 128 1297.2 1284.1 1.0% 91.74 93.24 -1.6%
GLM-5-FP8 8192/1024 32 754.5 756.9 -0.3% 40.01 40.01 0.0%
Kimi-K2.5-MXFP4 1024/1024 16 1091.1 1103.0 -1.1% 14.16 14.09 0.5%
Kimi-K2.5-MXFP4 1024/1024 256 4959.6 4960.6 -0.0% 49.54 49.82 -0.6%
MiniMax-M2.7-MXFP4 1024/1024 4 380.1 389.1 -2.3% 10.12 9.89 2.3%
MiniMax-M2.7-MXFP4 1024/1024 16 1050.0 1039.5 1.0% 14.79 14.96 -1.2%
MiniMax-M2.7-MXFP4 1024/1024 32 1642.0 1620.3 1.3% 18.82 19.18 -1.9%
Qwen3.5-397B-A17B-FP8 1024/1024 64 2818.3 2857.6 -1.4% 21.92 21.64 1.3%
Qwen3.5-397B-A17B-FP8 8192/1024 4 409.9 423.2 -3.1% 9.23 8.96 3.0%
Qwen3.5-397B-A17B-FP8 8192/1024 8 708.6 731.4 -3.1% 10.64 10.40 2.4%
Qwen3.5-397B-A17B-MXFP4 8192/1024 64 2326.7 2355.1 -1.2% 25.81 25.72 0.3%
gpt-oss-120b 1024/1024 16 2641.3 2621.1 0.8% 5.71 5.84 -2.3%

Performance Summary

# Trace Performance Summary

**File:** `DeepSeek-R1-0528_ts_20260519_054632_204.pt.trace.json.gz`

## Prefill

| # | Label | Duration |
|---|-------|----------|
| 0 | `prefill[bs=2 tok=901 ctx=[991, 886]]` | 87.24 ms |
| 1 | `prefill[bs=6 tok=5577 ctx=[1014, 866, 922]...+3]` | 87.64 ms |
| 2 | `prefill[bs=1 tok=840 ctx=840]` | 87.61 ms |
| 3 | `prefill[bs=1 tok=855 ctx=855]` | 83.61 ms |
| 4 | `prefill[bs=1 tok=906 ctx=906]` | 87.18 ms |
| 5 | `prefill[bs=1 tok=889 ctx=889]` | 82.08 ms |
| 6 | `prefill[bs=2 tok=1866 ctx=[907, 959]]` | 85.97 ms |
| 7 | `prefill[bs=1 tok=877 ctx=877]` | 86.78 ms |
| 8 | `prefill[bs=1 tok=1012 ctx=1012]` | 80.91 ms |

**Total prefill:** 769.02 ms

## Decode

- **Iterations:** 2006
- **Mean:** 898.6 us
- **Min:** 723.8 us
- **Max:** 7.57 ms
- **Total:** 1802.49 ms

Profiler Traces

Download from workflow artifacts.
Open in Perfetto UI or Chrome chrome://tracing for analysis.

Next Steps

  1. Download profiler-analysis-26050231461 artifact
  2. Open trace files in Perfetto UI
  3. Compare kernel durations against previous traces
  4. Identify bottleneck changes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions