optimize group_norm for ASCEND_NPU by sunyi0505 · Pull Request #1154 · linkedin/Liger-Kernel

sunyi0505 · 2026-03-20T08:00:41Z

Summary

Testing Done

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

sunyi0505 · 2026-03-20T08:03:30Z

1. Main Optimization Points and Analysis

1.1 Forward Pass: Single Kernel → Two‑Kernel Split

Original Version

_group_norm_forward_kernel performs two passes in one kernel:
First pass iterates over all columns to compute mean and variance for each batch‑group;
Second pass iterates again over all columns to perform normalization and affine transformation.

Optimized Version

Forward pass split into two independent kernels:
_group_norm_forward_stats_kernel: only computes mean and variance (statistics);
_group_norm_forward_affine_kernel: performs normalization and affine transformation based on the statistics.

Why This Optimization

Reduces register pressure: The original kernel must hold many intermediate variables (mean, variance, column offsets, etc.) simultaneously, which can easily exceed register capacity and cause spilling to memory. After splitting, each kernel has a smaller set of live variables, allowing larger BLOCK_SIZE and higher compute density.
Improves data reuse: After the statistics kernel writes to global memory, the affine kernel can read them multiple times. For later fusion opportunities (e.g., sharing statistics with other operators), the split intermediate results are easier to reuse.
Simplifies scheduling: Each kernel has a more regular grid shape (first dimension = groups, second = batch blocks), which is easier to tune.

1.2 Adaptive Block Size Selection

Original Version

Uses fixed heuristics: BLOCK_SIZE_N = min(128, next_power_of_2(hidden_size)), BLOCK_SIZE_M determined by compute_default_tiling_strategy (based on UB capacity estimation).

Optimized Version

Introduces multi‑level adaptive functions
_select_reduce_block_size: returns 128/256/512/1024 based on hidden_size.
_group_norm_forward_stats_block_size: considers alignment requirements (32‑byte alignment) and maximum fusion size.
_group_norm_forward_affine_spatial_block_size: selects a power of two between 32 and 256 according to the spatial dimension inside a channel.
_group_norm_forward_affine_channel_block_size: dynamically computes tile size along channel dimension based on number of channels and block_h.
_group_norm_forward_launch_config: dynamically adjusts batch‑dimension tile size and grid size based on NPU core count and oversubscription factor (GRID_OVERSUB_FACTOR = 8).

Why This Optimization

Adapts to performance inflection points of different hidden_size: The reduction dimension of GroupNorm (hidden_size) varies widely (from dozens to thousands). Fixed tile sizes waste parallelism for small sizes and cause non‑coalesced memory access for large sizes. Adaptive strategies make the operator near‑optimal for different shapes.
Satisfies hardware alignment constraints: NPUs have alignment requirements for vectorized memory access (e.g., 32‑byte alignment). _group_norm_forward_stats_block_size explicitly forces block_h to be no less than required, avoiding inefficient unaligned access.
Balances load and core utilization: _group_norm_forward_launch_config uses an oversubscription factor (8× core count) to fully utilize all NPU cores even when the number of tasks is insufficient, while avoiding excessive scheduling overhead from too‑small tiles.

1.3 Small‑Size Fusion Path (Single‑task Kernel)

In the optimized version, when hidden_size <= MAX_FUSED_FORWARD_SIZE (4096), the _group_norm_forward_single_task_kernel is enabled. In this kernel, each program handles one complete batch‑group (i.e., one row) and completes both statistics and affine transformation inside a single kernel.

Why This Optimization

When hidden_size is small, the overhead of launching two separate kernels (parameter passing, reading/writing statistics to memory) may outweigh the benefits. The single‑task kernel merges the two passes into one kernel, uses a larger BLOCK_H (up to a power of two of hidden_size), reduces the number of loop iterations and memory round‑trips.
In the single‑task kernel, each program loads W/B independently without cross‑thread synchronization; the code path is shorter and more friendly to small problems.

1.4 Parameter Gradient Computation: Atomic Operations → Explicit Partial Reduction

Original Version

Fast path (when SINGLE_CHANNEL_TILE is true): uses tl.atomic_add inside the backward kernel to accumulate partial sums of dW/dB into a global scratch buffer.
Slow path: falls back to host‑side dense reshape and direct summation to avoid atomic contention.

Optimized Version

The backward kernel _group_norm_backward_dx_dwdb_kernel no longer uses atomic operations. Instead, it writes the partial sums for each (batch_block, group, channel) into independent arrays DW_partial / DB_partial (shape [num_partial_rows, num_channels]).
A separate kernel _group_norm_reduce_param_grads_kernel is launched to reduce these partial sums.

Why This Optimization

Eliminates non‑determinism and performance bottlenecks of atomic operations: Atomic operations under high contention (e.g., small channels_per_group but large batch) cause severe serialization and numerical non‑determinism. The optimized version converts parallel writes into conflict‑free independent writes via partial sum matrices, and finally accumulates with an efficient reduction kernel.
Better suited for NPU architecture: NPUs typically have weaker atomic operation performance than GPUs and limited support for floating‑point atomics. Explicit reduction moves the contention to the reduction stage, where larger tiles and vectorized memory access can be used.
Preserves precision: The reduction kernel uses fp32 accumulation and has no order‑dependence from atomics, resulting in stronger determinism.

1.5 Parallel Strategy Adjustment

Original Version

Uses persistent program mode: a kernel launches a fixed number of programs (num_cores), and each program loops over multiple row tiles (block_m). This mode attempts to reduce launch overhead but increases programming complexity and adapts poorly to dynamic tile sizes.

Optimized Version

Returns to traditional Triton grid launch: Forward statistics kernel uses a 2D grid (num_groups, stats_grid_batch), Affine kernel uses a 2D grid (num_groups * channel_blocks, affine_grid_batch). Each program handles one fixed tile with no internal loop.
Dynamically adjusts block size in each dimension via _group_norm_forward_launch_config, making the total number of programs close to num_cores * GRID_OVERSUB_FACTOR.

Why This Optimization

Simplifies debugging and maintenance: Kernels without persistent loops have clearer logic and are easier to analyze for performance bottlenecks.
Better load balancing: When batch_size is not divisible by BLOCK_BATCH, the traditional grid mode naturally handles the trailing incomplete blocks, while persistent mode requires extra boundary checks.
Adapts to NPU scheduler: NPU thread schedulers are more friendly to static grids; dynamic loops may introduce extra branch overhead.

forward:

backward:

ut:

sunyi0505 · 2026-04-07T08:39:53Z

@Tcc0403 , could you help review my code?

Tcc0403

LGTM

sunyi0505 changed the title ~~[WIP] optimize group_norm for ASCEND_NPU~~ [WIP] optimize group_norm_backward for ASCEND_NPU Mar 20, 2026

sunyi0505 marked this pull request as draft March 20, 2026 08:01

sunyi0505 changed the title ~~[WIP] optimize group_norm_backward for ASCEND_NPU~~ [WIP] optimize group_normfor ASCEND_NPU Mar 24, 2026

sunyi0505 force-pushed the main branch from 5b29e0f to f795a0c Compare March 24, 2026 06:46

sunyi0505 force-pushed the main branch 7 times, most recently from c9d43b2 to f9b96e4 Compare April 3, 2026 03:09

sunyi0505 changed the title ~~[WIP] optimize group_normfor ASCEND_NPU~~ optimize group_normfor ASCEND_NPU Apr 3, 2026

sunyi0505 marked this pull request as ready for review April 3, 2026 03:12

sunyi0505 force-pushed the main branch 3 times, most recently from a1b9562 to 3719207 Compare April 7, 2026 08:38

sunyi0505 changed the title ~~optimize group_normfor ASCEND_NPU~~ optimize group_norm for ASCEND_NPU Apr 7, 2026

sunyi0505 force-pushed the main branch 4 times, most recently from 8872b43 to 99b4e13 Compare April 10, 2026 01:14

zheliuyu reviewed Apr 11, 2026

View reviewed changes

Comment thread src/liger_kernel/ops/backends/_ascend/ops/group_norm.py Outdated

optimize group_norm for ASCEND NPU

74ceb37

sunyi0505 force-pushed the main branch from 99b4e13 to 74ceb37 Compare April 13, 2026 02:25

Tcc0403 approved these changes Apr 13, 2026

View reviewed changes

Tcc0403 enabled auto-merge April 13, 2026 14:48

Tcc0403 added this pull request to the merge queue Apr 13, 2026

Merged via the queue into linkedin:main with commit 76a0821 Apr 13, 2026
5 of 7 checks passed

zheliuyu mentioned this pull request Apr 14, 2026

[NPU Roadmap, Updated to 2026-Q2] NPU support for Liger-Kernel #969

Open

36 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize group_norm for ASCEND_NPU#1154

optimize group_norm for ASCEND_NPU#1154
Tcc0403 merged 1 commit intolinkedin:mainfrom
sunyi0505:main

sunyi0505 commented Mar 20, 2026 •

edited

Loading

Uh oh!

sunyi0505 commented Mar 20, 2026 •

edited

Loading

Uh oh!

sunyi0505 commented Apr 7, 2026

Uh oh!

Uh oh!

Tcc0403 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sunyi0505 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

sunyi0505 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Main Optimization Points and Analysis

1.1 Forward Pass: Single Kernel → Two‑Kernel Split

Original Version

Optimized Version

Why This Optimization

1.2 Adaptive Block Size Selection

Original Version

Optimized Version

Why This Optimization

1.3 Small‑Size Fusion Path (Single‑task Kernel)

Why This Optimization

1.4 Parameter Gradient Computation: Atomic Operations → Explicit Partial Reduction

Original Version

Optimized Version

Why This Optimization

1.5 Parallel Strategy Adjustment

Original Version

Optimized Version

Why This Optimization

Uh oh!

sunyi0505 commented Apr 7, 2026

Uh oh!

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sunyi0505 commented Mar 20, 2026 •

edited

Loading

sunyi0505 commented Mar 20, 2026 •

edited

Loading