[NPU]Improvement performence for grpo_loss by UserChen666 · Pull Request #1174 · linkedin/Liger-Kernel

UserChen666 · 2026-03-31T08:50:56Z

Summary

This PR implements Triton kernel-level performance optimizations for the grpo_loss operator.

Details

Block Size (BLOCK_N) & Memory Coefficient Tuning
Unified default tiling by increasing BLOCK_N from 2048 to 4096 across all kernels, improving per-batch computation granularity and reducing loop counts and kernel launch overhead.
Adjusted memory multipliers downward to fit the NPU’s 192KB Unified Buffer (UB) capacity and enhance memory utilization:
Softmax: 6.0 → 3.0
Forward: 10.0 → 4.0
Backward: 12.0 → 8.0
Computation Instruction Optimization (Reduce Divisions, Improve Instruction Efficiency)
Precompute inv_temp = 1.0 / TEMPERATURE to replace multiple in-loop divisions with single multiplications, reducing floating-point latency.
Simplified backward gradient expression: dlogp = dlogp * dloss * inv_temp instead of the original chained division, lowering the number of floating-point operations.
Loop & Compilation Optimizations
Changed inner kernel loops from range to tl.static_range to provide explicit loop-unrolling hints to the compiler, optimizing instruction scheduling and pipeline efficiency.
Explicit index type casting: INPUT_IDS cast to int32 to avoid implicit type conversion overhead on NPU.
Masking & Memory Access Optimization
Unified use of the cols_mask variable to reuse memory access masks, reducing redundant calculations and improving memory access throughput.
Simplified gradient calculation logic: (cols_idx - probs) * dlogp instead of tl.where branching, minimizing branch judgment overhead.

Testing Done

Hardware Type: Atlas A800I
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

UserChen666 · 2026-04-07T08:27:27Z

@Tcc0403 please review, thank you.

UserChen666 · 2026-04-10T06:36:04Z

@Tcc0403

Tcc0403

LGTM

UserChen666 added 2 commits March 31, 2026 16:34

improve performence for grpo_loss

c8ee36f

improve performence for grpo_loss

c8c9497

UserChen666 changed the title ~~Improvement performence for grpo_loss~~ 【NPU】Improvement performence for grpo_loss Apr 1, 2026

UserChen666 added 2 commits April 2, 2026 14:52

improve performence for grpo_loss

7253920

improve performence for grpo_loss

1d6bf01

zheliuyu mentioned this pull request Apr 14, 2026

[NPU Roadmap, Updated to 2026-Q2] NPU support for Liger-Kernel #969

Open

36 tasks

UserChen666 changed the title ~~【NPU】Improvement performence for grpo_loss~~ [NPU]Improvement performence for grpo_loss Apr 14, 2026

Tcc0403 approved these changes Apr 16, 2026

View reviewed changes

Tcc0403 added this pull request to the merge queue Apr 16, 2026

Merged via the queue into linkedin:main with commit fcaae50 Apr 16, 2026
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU]Improvement performence for grpo_loss#1174

[NPU]Improvement performence for grpo_loss#1174
Tcc0403 merged 4 commits intolinkedin:mainfrom
UserChen666:improvement

UserChen666 commented Mar 31, 2026 •

edited

Loading

Uh oh!

UserChen666 commented Apr 7, 2026

Uh oh!

UserChen666 commented Apr 10, 2026

Uh oh!

Tcc0403 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

UserChen666 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing Done

Uh oh!

UserChen666 commented Apr 7, 2026

Uh oh!

UserChen666 commented Apr 10, 2026

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UserChen666 commented Mar 31, 2026 •

edited

Loading