Skip to content

Apply some random warpspeed tunings#8352

Merged
bernhardmgruber merged 1 commit intoNVIDIA:mainfrom
bernhardmgruber:tune_warpspeed
Apr 14, 2026
Merged

Apply some random warpspeed tunings#8352
bernhardmgruber merged 1 commit intoNVIDIA:mainfrom
bernhardmgruber:tune_warpspeed

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber commented Apr 9, 2026

Works towards: #8348. Those are not the result of serious tuning, just me testing the random tuning search to proof that we are ready for real tuning.

cub.bench.scan.exclusive.sum.warpspeed.base on B200:

## [0] NVIDIA B200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I64      |      2^16      |   9.485 us |       7.39% |  11.264 us |       0.84% |    1.779 us |  18.76% |   SLOW   |
|   I8    |      I64      |      2^20      |  12.832 us |       7.97% |  13.283 us |       3.96% |    0.451 us |   3.52% |   SAME   |
|   I8    |      I64      |      2^24      |  32.877 us |       3.49% |  28.341 us |       3.49% |   -4.536 us | -13.80% |   FAST   |
|   I8    |      I64      |      2^28      | 260.388 us |       0.38% | 207.747 us |       0.60% |  -52.641 us | -20.22% |   FAST   |
|   I8    |      I64      |      2^32      |   3.949 ms |       0.05% |   3.123 ms |       0.27% | -826.017 us | -20.92% |   FAST   |
|   I16   |      I64      |      2^16      |  11.209 us |       0.74% |  11.210 us |       0.63% |    0.001 us |   0.01% |   SAME   |
|   I16   |      I64      |      2^20      |  13.103 us |       4.64% |  12.509 us |       7.98% |   -0.595 us |  -4.54% |   SAME   |
|   I16   |      I64      |      2^24      |  30.882 us |       5.62% |  28.447 us |       3.77% |   -2.435 us |  -7.88% |   FAST   |
|   I16   |      I64      |      2^28      | 248.879 us |       0.61% | 211.693 us |       0.60% |  -37.186 us | -14.94% |   FAST   |
|   I16   |      I64      |      2^32      |   3.743 ms |       0.18% |   3.195 ms |       1.08% | -547.339 us | -14.62% |   FAST   |
|   I32   |      I64      |      2^16      |   9.832 us |       9.57% |  10.203 us |      10.05% |    0.371 us |   3.77% |   SAME   |
|   I32   |      I64      |      2^20      |  13.531 us |       5.06% |  13.477 us |       5.19% |   -0.054 us |  -0.40% |   SAME   |
|   I32   |      I64      |      2^24      |  37.159 us |       4.95% |  37.201 us |       5.08% |    0.043 us |   0.11% |   SAME   |
|   I32   |      I64      |      2^28      | 320.292 us |       0.62% | 320.278 us |       0.57% |   -0.014 us |  -0.00% |   SAME   |
|   I32   |      I64      |      2^32      |   5.087 ms |       1.24% |   5.091 ms |       1.16% |    4.720 us |   0.09% |   SAME   |
|   I64   |      I64      |      2^16      |  11.357 us |       3.96% |  11.752 us |       7.34% |    0.395 us |   3.48% |   SAME   |
|   I64   |      I64      |      2^20      |  15.355 us |       1.69% |  15.500 us |       3.44% |    0.145 us |   0.94% |   SAME   |
|   I64   |      I64      |      2^24      |  58.876 us |       1.70% |  59.171 us |       1.81% |    0.295 us |   0.50% |   SAME   |
|   I64   |      I64      |      2^28      | 694.527 us |       0.17% | 694.738 us |       0.16% |    0.211 us |   0.03% |   SAME   |
|   I64   |      I64      |      2^32      |  11.253 ms |       0.71% |  11.250 ms |       0.82% |   -3.704 us |  -0.03% |   SAME   |
|  I128   |      I64      |      2^16      |  12.987 us |       5.64% |  12.420 us |       8.16% |   -0.567 us |  -4.36% |   SAME   |
|  I128   |      I64      |      2^20      |  23.506 us |       1.93% |  23.504 us |       1.53% |   -0.002 us |  -0.01% |   SAME   |
|  I128   |      I64      |      2^24      | 171.297 us |       0.64% | 171.405 us |       0.62% |    0.107 us |   0.06% |   SAME   |
|  I128   |      I64      |      2^28      |   2.511 ms |       0.10% |   2.511 ms |       0.10% |   -0.186 us |  -0.01% |   SAME   |
|  I128   |      I64      |      2^32      |  40.023 ms |       0.02% |  40.022 ms |       0.02% |   -1.429 us |  -0.00% |   SAME   |
|   F32   |      I64      |      2^16      |  10.354 us |      10.64% |  10.203 us |      10.12% |   -0.151 us |  -1.45% |   SAME   |
|   F32   |      I64      |      2^20      |  14.010 us |       7.01% |  13.993 us |       7.08% |   -0.017 us |  -0.12% |   SAME   |
|   F32   |      I64      |      2^24      |  39.498 us |       4.42% |  39.362 us |       4.24% |   -0.137 us |  -0.35% |   SAME   |
|   F32   |      I64      |      2^28      | 345.532 us |       0.50% | 345.276 us |       0.49% |   -0.256 us |  -0.07% |   SAME   |
|   F32   |      I64      |      2^32      |   5.239 ms |       0.49% |   5.239 ms |       0.56% |    0.235 us |   0.00% |   SAME   |
|   F64   |      I64      |      2^16      |  10.928 us |       7.10% |  10.605 us |       9.03% |   -0.323 us |  -2.95% |   SAME   |
|   F64   |      I64      |      2^20      |  15.370 us |       1.09% |  15.346 us |       1.61% |   -0.024 us |  -0.16% |   SAME   |
|   F64   |      I64      |      2^24      |  63.947 us |       1.60% |  63.637 us |       1.69% |   -0.310 us |  -0.49% |   SAME   |
|   F64   |      I64      |      2^28      | 785.046 us |       0.12% | 784.668 us |       0.14% |   -0.378 us |  -0.05% |   SAME   |
|   F64   |      I64      |      2^32      |  12.338 ms |       0.01% |  12.338 ms |       0.01% |    0.050 us |   0.00% |   SAME   |
|   C32   |      I64      |      2^16      |  10.804 us |       7.83% |  10.517 us |       9.41% |   -0.286 us |  -2.65% |   SAME   |
|   C32   |      I64      |      2^20      |  15.296 us |       2.31% |  15.277 us |       2.68% |   -0.019 us |  -0.12% |   SAME   |
|   C32   |      I64      |      2^24      |  59.842 us |       1.75% |  59.816 us |       1.91% |   -0.026 us |  -0.04% |   SAME   |
|   C32   |      I64      |      2^28      | 682.702 us |       0.17% | 682.236 us |       0.17% |   -0.466 us |  -0.07% |   SAME   |
|   C32   |      I64      |      2^32      |  10.659 ms |       0.02% |  10.660 ms |       0.02% |    0.158 us |   0.00% |   SAME   |

The regression for tiny I8 problems is a bit sad, but not an issue, since warpspeed scan has not shipped publicly yet. Since this PR only changes tunings, it's not a regression in the kernel's code.

I8 and I16 absolute results after this PR:

| T{ct} | OffsetT{ct} |   Elements{io}    |    Size     | Samples |  CPU Time  |  Noise  |  GPU Time  | Noise  |  Elem/s  | GlobalMem BW | BWUtil |
|-------|-------------|-------------------|-------------|---------|------------|---------|------------|--------|----------|--------------|--------|
|    I8 |         I64 |      2^16 = 65536 |  64.000 KiB |    398x |  49.706 us | 416.15% |  11.264 us |  0.84% |   5.818G |  11.636 GB/s |  0.15% |
|    I8 |         I64 |    2^20 = 1048576 |   1.000 MiB |    372x |  41.686 us |   3.69% |  13.283 us |  3.96% |  78.943G | 157.885 GB/s |  2.06% |
|    I8 |         I64 |   2^24 = 16777216 |  16.000 MiB |    354x |  57.191 us |   2.91% |  28.341 us |  3.49% | 591.985G |   1.184 TB/s | 15.43% |
|    I8 |         I64 |  2^28 = 268435456 | 256.000 MiB |    396x | 237.714 us |   0.71% | 207.747 us |  0.60% |   1.292T |   2.584 TB/s | 33.68% |
|    I8 |         I64 | 2^32 = 4294967296 |   4.000 GiB |    370x |   3.155 ms |   0.26% |   3.123 ms |  0.27% |   1.375T |   2.750 TB/s | 35.85% |
|   I16 |         I64 |      2^16 = 65536 | 128.000 KiB |    382x |  42.885 us |   3.69% |  11.210 us |  0.63% |   5.846G |  23.385 GB/s |  0.30% |
|   I16 |         I64 |    2^20 = 1048576 |   2.000 MiB |    380x |  44.050 us |   4.26% |  12.509 us |  7.98% |  83.829G | 335.315 GB/s |  4.37% |
|   I16 |         I64 |   2^24 = 16777216 |  32.000 MiB |    360x |  59.403 us |   2.54% |  28.447 us |  3.77% | 589.765G |   2.359 TB/s | 30.75% |
|   I16 |         I64 |  2^28 = 268435456 | 512.000 MiB |    498x | 242.576 us |   0.86% | 211.693 us |  0.60% |   1.268T |   5.072 TB/s | 66.11% |
|   I16 |         I64 | 2^32 = 4294967296 |   8.000 GiB |    750x |   3.226 ms |   1.07% |   3.195 ms |  1.08% |   1.344T |   5.377 TB/s | 70.08% |

This makes it 3x faster than the old scan implementation.

@github-project-automation github-project-automation bot moved this to Todo in CCCL Apr 9, 2026
@bernhardmgruber bernhardmgruber requested a review from a team as a code owner April 9, 2026 22:40
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Apr 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 44m: Pass: 100%/269 | Total: 8d 18h | Max: 1h 25m | Hits: 70%/177111

See results here.

@bernhardmgruber bernhardmgruber merged commit 4a45c18 into NVIDIA:main Apr 14, 2026
289 of 293 checks passed
@bernhardmgruber bernhardmgruber deleted the tune_warpspeed branch April 14, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants