Skip to content

optimize group_norm for ASCEND_NPU#1154

Merged
Tcc0403 merged 1 commit intolinkedin:mainfrom
sunyi0505:main
Apr 13, 2026
Merged

optimize group_norm for ASCEND_NPU#1154
Tcc0403 merged 1 commit intolinkedin:mainfrom
sunyi0505:main

Conversation

@sunyi0505
Copy link
Copy Markdown
Contributor

@sunyi0505 sunyi0505 commented Mar 20, 2026

Summary

Testing Done

  • Hardware Type:
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@sunyi0505 sunyi0505 changed the title [WIP] optimize group_norm for ASCEND_NPU [WIP] optimize group_norm_backward for ASCEND_NPU Mar 20, 2026
@sunyi0505 sunyi0505 marked this pull request as draft March 20, 2026 08:01
@sunyi0505
Copy link
Copy Markdown
Contributor Author

sunyi0505 commented Mar 20, 2026

1. Main Optimization Points and Analysis

1.1 Forward Pass: Single Kernel → Two‑Kernel Split

Original Version

  • _group_norm_forward_kernel performs two passes in one kernel:

  • First pass iterates over all columns to compute mean and variance for each batch‑group;

  • Second pass iterates again over all columns to perform normalization and affine transformation.

Optimized Version

  • Forward pass split into two independent kernels:

  • _group_norm_forward_stats_kernel: only computes mean and variance (statistics);

  • _group_norm_forward_affine_kernel: performs normalization and affine transformation based on the statistics.

Why This Optimization

  • Reduces register pressure: The original kernel must hold many intermediate variables (mean, variance, column offsets, etc.) simultaneously, which can easily exceed register capacity and cause spilling to memory. After splitting, each kernel has a smaller set of live variables, allowing larger BLOCK_SIZE and higher compute density.

  • Improves data reuse: After the statistics kernel writes to global memory, the affine kernel can read them multiple times. For later fusion opportunities (e.g., sharing statistics with other operators), the split intermediate results are easier to reuse.

  • Simplifies scheduling: Each kernel has a more regular grid shape (first dimension = groups, second = batch blocks), which is easier to tune.

1.2 Adaptive Block Size Selection

Original Version

  • Uses fixed heuristics: BLOCK_SIZE_N = min(128, next_power_of_2(hidden_size)), BLOCK_SIZE_M determined by compute_default_tiling_strategy (based on UB capacity estimation).

Optimized Version

  • Introduces multi‑level adaptive functions
    _select_reduce_block_size: returns 128/256/512/1024 based on hidden_size.

  • _group_norm_forward_stats_block_size: considers alignment requirements (32‑byte alignment) and maximum fusion size.

  • _group_norm_forward_affine_spatial_block_size: selects a power of two between 32 and 256 according to the spatial dimension inside a channel.

  • _group_norm_forward_affine_channel_block_size: dynamically computes tile size along channel dimension based on number of channels and block_h.

  • _group_norm_forward_launch_config: dynamically adjusts batch‑dimension tile size and grid size based on NPU core count and oversubscription factor (GRID_OVERSUB_FACTOR = 8).

Why This Optimization

  • Adapts to performance inflection points of different hidden_size: The reduction dimension of GroupNorm (hidden_size) varies widely (from dozens to thousands). Fixed tile sizes waste parallelism for small sizes and cause non‑coalesced memory access for large sizes. Adaptive strategies make the operator near‑optimal for different shapes.

  • Satisfies hardware alignment constraints: NPUs have alignment requirements for vectorized memory access (e.g., 32‑byte alignment). _group_norm_forward_stats_block_size explicitly forces block_h to be no less than required, avoiding inefficient unaligned access.

  • Balances load and core utilization: _group_norm_forward_launch_config uses an oversubscription factor (8× core count) to fully utilize all NPU cores even when the number of tasks is insufficient, while avoiding excessive scheduling overhead from too‑small tiles.

1.3 Small‑Size Fusion Path (Single‑task Kernel)

  • In the optimized version, when hidden_size <= MAX_FUSED_FORWARD_SIZE (4096), the _group_norm_forward_single_task_kernel is enabled. In this kernel, each program handles one complete batch‑group (i.e., one row) and completes both statistics and affine transformation inside a single kernel.

Why This Optimization

  • When hidden_size is small, the overhead of launching two separate kernels (parameter passing, reading/writing statistics to memory) may outweigh the benefits. The single‑task kernel merges the two passes into one kernel, uses a larger BLOCK_H (up to a power of two of hidden_size), reduces the number of loop iterations and memory round‑trips.

  • In the single‑task kernel, each program loads W/B independently without cross‑thread synchronization; the code path is shorter and more friendly to small problems.

1.4 Parameter Gradient Computation: Atomic Operations → Explicit Partial Reduction

Original Version

  • Fast path (when SINGLE_CHANNEL_TILE is true): uses tl.atomic_add inside the backward kernel to accumulate partial sums of dW/dB into a global scratch buffer.

  • Slow path: falls back to host‑side dense reshape and direct summation to avoid atomic contention.

Optimized Version

  • The backward kernel _group_norm_backward_dx_dwdb_kernel no longer uses atomic operations. Instead, it writes the partial sums for each (batch_block, group, channel) into independent arrays DW_partial / DB_partial (shape [num_partial_rows, num_channels]).

  • A separate kernel _group_norm_reduce_param_grads_kernel is launched to reduce these partial sums.

Why This Optimization

  • Eliminates non‑determinism and performance bottlenecks of atomic operations: Atomic operations under high contention (e.g., small channels_per_group but large batch) cause severe serialization and numerical non‑determinism. The optimized version converts parallel writes into conflict‑free independent writes via partial sum matrices, and finally accumulates with an efficient reduction kernel.

  • Better suited for NPU architecture: NPUs typically have weaker atomic operation performance than GPUs and limited support for floating‑point atomics. Explicit reduction moves the contention to the reduction stage, where larger tiles and vectorized memory access can be used.

  • Preserves precision: The reduction kernel uses fp32 accumulation and has no order‑dependence from atomics, resulting in stronger determinism.

1.5 Parallel Strategy Adjustment

Original Version

  • Uses persistent program mode: a kernel launches a fixed number of programs (num_cores), and each program loops over multiple row tiles (block_m). This mode attempts to reduce launch overhead but increases programming complexity and adapts poorly to dynamic tile sizes.

Optimized Version

  • Returns to traditional Triton grid launch: Forward statistics kernel uses a 2D grid (num_groups, stats_grid_batch), Affine kernel uses a 2D grid (num_groups * channel_blocks, affine_grid_batch). Each program handles one fixed tile with no internal loop.

  • Dynamically adjusts block size in each dimension via _group_norm_forward_launch_config, making the total number of programs close to num_cores * GRID_OVERSUB_FACTOR.

Why This Optimization

  • Simplifies debugging and maintenance: Kernels without persistent loops have clearer logic and are easier to analyze for performance bottlenecks.

  • Better load balancing: When batch_size is not divisible by BLOCK_BATCH, the traditional grid mode naturally handles the trailing incomplete blocks, while persistent mode requires extra boundary checks.

  • Adapts to NPU scheduler: NPU thread schedulers are more friendly to static grids; dynamic loops may introduce extra branch overhead.

forward:
image
backward:
image
ut:
image

@sunyi0505 sunyi0505 changed the title [WIP] optimize group_norm_backward for ASCEND_NPU [WIP] optimize group_normfor ASCEND_NPU Mar 24, 2026
@sunyi0505 sunyi0505 force-pushed the main branch 7 times, most recently from c9d43b2 to f9b96e4 Compare April 3, 2026 03:09
@sunyi0505 sunyi0505 changed the title [WIP] optimize group_normfor ASCEND_NPU optimize group_normfor ASCEND_NPU Apr 3, 2026
@sunyi0505 sunyi0505 marked this pull request as ready for review April 3, 2026 03:12
@sunyi0505 sunyi0505 force-pushed the main branch 3 times, most recently from a1b9562 to 3719207 Compare April 7, 2026 08:38
@sunyi0505
Copy link
Copy Markdown
Contributor Author

@Tcc0403 , could you help review my code?

@sunyi0505 sunyi0505 changed the title optimize group_normfor ASCEND_NPU optimize group_norm for ASCEND_NPU Apr 7, 2026
@sunyi0505 sunyi0505 force-pushed the main branch 4 times, most recently from 8872b43 to 99b4e13 Compare April 10, 2026 01:14
Comment thread src/liger_kernel/ops/backends/_ascend/ops/group_norm.py Outdated
Copy link
Copy Markdown
Collaborator

@Tcc0403 Tcc0403 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tcc0403 Tcc0403 enabled auto-merge April 13, 2026 14:48
@Tcc0403 Tcc0403 added this pull request to the merge queue Apr 13, 2026
Merged via the queue into linkedin:main with commit 76a0821 Apr 13, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants