[NPU] Add NPU support for multi_token_attention by lowdy1 · Pull Request #1122 · linkedin/Liger-Kernel

lowdy1 · 2026-03-03T07:38:02Z

Summary

The current MTA kernel suffers from UB overflow issues and suboptimal benchmark performance. This PR introduces an NPU-optimized implementation of MTA to address these limitations.

The new implementation:

Fuses causal masking with Softmax and Sparsemax (both forward and backward)
Uses row-wise 1D processing to improve memory efficiency
Introduces a 3-axis masking kernel with UB-aware BLOCK_SIZE estimation for matrix masking

In terms of performance:

The fused softmax kernel shows lower performance than the benchmark mainly because of softmax kernel.

The sparsemax variant demonstrates better performance than the benchmark.

Testing Done

Tested passed with
python benchmark/scripts/benchmark_multi_token_attention.py
python benchmark/scripts/benchmark_sparse_multi_token_attention.py
pytest -v test/transformers/test_multi_token_attention.py

Hardware Type: Atlas 800I A2

run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

lowdy1 · 2026-03-27T03:43:49Z

sparse multi token attention(full)

********** Benchmark Data **********
[
  {
    "kernel_name": "sparse_multi_token_attention",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      0.7001199722290039,
      0.7727699875831604,
      0.7904900312423706,
      1.1531000137329102,
      3.8688600063323975
    ],
    "y_values_20": [
      0.6914240121841431,
      0.7227479815483093,
      0.785315990447998,
      1.152340054512024,
      3.866948127746582
    ],
    "y_values_80": [
      0.7167999744415283,
      0.7844640016555786,
      0.8020159602165222,
      1.1549040079116821,
      3.870512008666992
    ],
    "timestamp": "2026-03-27 03:40:52",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "sparse_multi_token_attention",
    "kernel_provider": "torch",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      1.3974699974060059,
      1.3797800540924072,
      1.3436000347137451,
      1.3312599658966064,
      1.8758900165557861
    ],
    "y_values_20": [
      1.362380027770996,
      1.3544399738311768,
      1.3287479877471924,
      1.3134479522705078,
      1.8714799880981445
    ],
    "y_values_80": [
      1.414720058441162,
      1.4019360542297363,
      1.3595600128173828,
      1.353887915611267,
      1.8824399709701538
    ],
    "timestamp": "2026-03-27 03:40:53",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "sparse_multi_token_attention",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      2.1086699962615967,
      2.099519968032837,
      2.0379199981689453,
      2.0469799041748047,
      5.994679927825928
    ],
    "y_values_20": [
      2.095531940460205,
      2.0890519618988037,
      2.0155279636383057,
      2.0349600315093994,
      5.992884159088135
    ],
    "y_values_80": [
      2.1241400241851807,
      2.1201000213623047,
      2.1170639991760254,
      2.0623438358306885,
      5.998091697692871
    ],
    "timestamp": "2026-03-27 03:40:55",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "sparse_multi_token_attention",
    "kernel_provider": "torch",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      2.57886004447937,
      2.6841700077056885,
      2.678040027618408,
      4.13923978805542,
      10.076720237731934
    ],
    "y_values_20": [
      2.5452640056610107,
      2.677191972732544,
      2.6581079959869385,
      4.136340141296387,
      10.042584419250488
    ],
    "y_values_80": [
      2.597368001937866,
      2.705887794494629,
      2.7474639415740967,
      4.165679931640625,
      10.09320068359375
    ],
    "timestamp": "2026-03-27 03:40:57",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "sparse_multi_token_attention",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      1.1525399684906006,
      1.142680048942566,
      1.13850998878479,
      1.1672300100326538,
      2.1818599700927734
    ],
    "y_values_20": [
      1.1415480375289917,
      1.1331239938735962,
      1.1294159889221191,
      1.1575520038604736,
      2.1802518367767334
    ],
    "y_values_80": [
      1.164080023765564,
      1.1606600284576416,
      1.1540279388427734,
      1.1784119606018066,
      2.183243989944458
    ],
    "timestamp": "2026-03-27 03:40:58",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "sparse_multi_token_attention",
    "kernel_provider": "torch",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      1.1083999872207642,
      1.0780400037765503,
      1.0452499389648438,
      2.4088399410247803,
      8.217260360717773
    ],
    "y_values_20": [
      1.08486807346344,
      1.0519800186157227,
      1.0308640003204346,
      2.3989601135253906,
      8.199560165405273
    ],
    "y_values_80": [
      1.1248040199279785,
      1.0946840047836304,
      1.0615719556808472,
      2.418260097503662,
      8.295040130615234
    ],
    "timestamp": "2026-03-27 03:40:58",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.7.0"
  }
]
**************************************
     BENCHMARKING MEMORY for SPARSE_MULTI_TOKEN_ATTENTION
**************************************
********** Benchmark Data **********
[
  {
    "kernel_name": "sparse_multi_token_attention",
    "kernel_provider": "liger",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      0.259765625,
      1.009765625,
      4.009765625,
      16.01025390625,
      64.01025390625
    ],
    "y_values_20": [
      0.259765625,
      1.009765625,
      4.009765625,
      16.01025390625,
      64.01025390625
    ],
    "y_values_80": [
      0.259765625,
      1.009765625,
      4.009765625,
      16.01025390625,
      64.01025390625
    ],
    "timestamp": "2026-03-27 03:40:59",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "sparse_multi_token_attention",
    "kernel_provider": "torch",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      16.39306640625,
      17.52734375,
      22.05712890625,
      40.16357421875,
      112.56396484375
    ],
    "y_values_20": [
      16.39306640625,
      17.52734375,
      22.05712890625,
      40.16357421875,
      112.56396484375
    ],
    "y_values_80": [
      16.39306640625,
      17.52734375,
      22.05712890625,
      40.16357421875,
      112.56396484375
    ],
    "timestamp": "2026-03-27 03:40:59",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.7.0"
  }
]

lowdy1 · 2026-03-27T03:46:13Z

softmax multi token attention

********** Benchmark Data **********
[
  {
    "kernel_name": "multi_token_attention",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      0.4847800135612488,
      0.4905399978160858,
      0.49355998635292053,
      0.5263800024986267,
      1.1110199689865112
    ],
    "y_values_20": [
      0.47667598724365234,
      0.48531201481819153,
      0.48580798506736755,
      0.518779993057251,
      1.1100200414657593
    ],
    "y_values_80": [
      0.4986119866371155,
      0.5014039874076843,
      0.5075839757919312,
      0.5366600155830383,
      1.1125600337982178
    ],
    "timestamp": "2026-03-28 06:25:45",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "multi_token_attention",
    "kernel_provider": "torch",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      0.6377099752426147,
      0.624779999256134,
      0.5950800180435181,
      0.5192599892616272,
      0.5347099900245667
    ],
    "y_values_20": [
      0.6250680088996887,
      0.5962679982185364,
      0.5858799815177917,
      0.5134959816932678,
      0.530019998550415
    ],
    "y_values_80": [
      0.6466799974441528,
      0.640720009803772,
      0.6036199927330017,
      0.524336040019989,
      0.5396199822425842
    ],
    "timestamp": "2026-03-28 06:25:46",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "multi_token_attention",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      1.6847400665283203,
      1.674399971961975,
      1.67576003074646,
      1.6934800148010254,
      3.1811299324035645
    ],
    "y_values_20": [
      1.6702200174331665,
      1.6640959978103638,
      1.6667560338974,
      1.677623987197876,
      3.1786320209503174
    ],
    "y_values_80": [
      1.6958400011062622,
      1.686244010925293,
      1.6899720430374146,
      1.7132160663604736,
      3.1833560466766357
    ],
    "timestamp": "2026-03-28 06:25:48",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "multi_token_attention",
    "kernel_provider": "torch",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      1.1896799802780151,
      1.1836199760437012,
      1.1693899631500244,
      0.995389997959137,
      1.0084400177001953
    ],
    "y_values_20": [
      1.1696200370788574,
      1.1602200269699097,
      1.153548002243042,
      0.9869880080223083,
      0.9992960095405579
    ],
    "y_values_80": [
      1.202620029449463,
      1.2052799463272095,
      1.180299997329712,
      1.0087920427322388,
      1.0189520120620728
    ],
    "timestamp": "2026-03-28 06:25:50",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "multi_token_attention",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      1.0227000713348389,
      1.0199300050735474,
      1.0222599506378174,
      1.1267399787902832,
      2.102799892425537
    ],
    "y_values_20": [
      1.0133600234985352,
      1.0102959871292114,
      1.0094200372695923,
      1.1195279359817505,
      2.100559949874878
    ],
    "y_values_80": [
      1.0360159873962402,
      1.0325120687484741,
      1.037019968032837,
      1.14109206199646,
      2.1052401065826416
    ],
    "timestamp": "2026-03-28 06:25:50",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.7.0"
  },
  {
    "kernel_name": "multi_token_attention",
    "kernel_provider": "torch",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "L",
    "x_label": "sequence length",
    "x_values": [
      32,
      64,
      128,
      256,
      512
    ],
    "y_values_50": [
      0.43494001030921936,
      0.42746999859809875,
      0.42092999815940857,
      0.32152000069618225,
      0.3873099982738495
    ],
    "y_values_20": [
      0.42366400361061096,
      0.41152000427246094,
      0.41095200181007385,
      0.3117559850215912,
      0.3843599855899811
    ],
    "y_values_80": [
      0.4484039843082428,
      0.4438599944114685,
      0.4334079921245575,
      0.3313400149345398,
      0.390720009803772
    ],
    "timestamp": "2026-03-28 06:25:51",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"B\": 2, \"C_in\": 4, \"C_out\": 4, \"K\": 3, \"groups\": 1, \"bias\": true, \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.7.0"
  }
]

lowdy1 changed the title ~~[NPU]Add NPU support for multi_token_attention~~ [NPU] Add NPU support for multi_token_attention Mar 3, 2026

lowdy1 added 2 commits March 27, 2026 02:12

add mta for NPU

8fdd559

--amend

87928a4

lowdy1 force-pushed the mta_npu branch from 7c9ff55 to 4cde098 Compare March 27, 2026 03:25

optimize rowwise

bbd247b

lowdy1 force-pushed the mta_npu branch from 4cde098 to bbd247b Compare March 27, 2026 03:51

use online softmax

9d5afc4

lowdy1 force-pushed the mta_npu branch from 393915f to 297546c Compare March 30, 2026 02:28

softmax forward 2 passes

1a70d04

lowdy1 force-pushed the mta_npu branch from 297546c to 1a70d04 Compare March 30, 2026 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU] Add NPU support for multi_token_attention#1122

[NPU] Add NPU support for multi_token_attention#1122
lowdy1 wants to merge 5 commits intolinkedin:mainfrom
lowdy1:mta_npu

lowdy1 commented Mar 3, 2026 •

edited

Loading

Uh oh!

lowdy1 commented Mar 27, 2026 •

edited

Loading

Uh oh!

lowdy1 commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lowdy1 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

lowdy1 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lowdy1 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lowdy1 commented Mar 3, 2026 •

edited

Loading

lowdy1 commented Mar 27, 2026 •

edited

Loading

lowdy1 commented Mar 27, 2026 •

edited

Loading