⚡ Thunderbolt: softmax — Single FMA Range Reduction by bugparty · Pull Request #40 · bugparty/cpu_math_kernels_pri

bugparty · 2026-05-22T20:08:49Z

💡 What: Added softmax_v6 containing a new exp256_ps_v3 evaluation function. This eliminates the two-part r = x - n * ln2_hi - n * ln2_lo range reduction step and instead folds it into a single _mm256_fnmadd_ps.

🎯 Why: Breaking the precision of the range reduction down into high and low parts costs execution cycles and registers. Since the softmax kernel inherently evaluates relative shifted constants (by previously subtracting the maximum array element from everything), absolute numerical floating point perfection in exp256 evaluates to negligible shifts at the softmax output.

🏗️ How: exp256_ps_v3 was introduced alongside softmax_v6 using r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x); replacing the sequence of two independent fnmadd_ps operations in exp256_ps_v2.

📊 Impact: Benchmarks indicate throughput on 1MB structures (N=1048576) scaled from 3.73 GFLOP/s up to 3.97 GFLOP/s, a roughly ~6.4% jump in execution throughput. Tests mapping deviation from absolute correctness show no values drift past 1e-4 tolerance limits.

🖥️ Tested on: Linux Sandbox, AVX2 compatible Haswell microarchitecture runtime.

🔬 How to reproduce:

DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter 'softmax' --iters 500 --warmup 50 --sizes 1048576

PR created automatically by Jules for task 17482068372732358033 started by @bugparty

Summary by CodeRabbit

New Features
- Added an optimized softmax implementation that improves computational throughput through enhanced exponential calculations and refined range reduction techniques.
Tests
- Added comprehensive tests validating the new implementation against reference implementations, including probability normalization verification.
Documentation
- Added development notes documenting the optimization approach, including detailed performance analysis and correctness considerations.

…ingle FMA Added `softmax_v6` utilizing `exp256_ps_v3` which combines the `r = x - n * ln(2)` range reduction step into a single `_mm256_fnmadd_ps` instruction. Precision loss from avoiding the high/low split is acceptable within `1e-4` precision tolerance due to softmax's shift-invariance. This reduces instruction latency and dependency chain size on the critical path. Measured an improvement from 3.73 GFLOP/s to 3.97 GFLOP/s for N=1048576 arrays. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-05-22T20:08:50Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-05-22T20:09:01Z

📝 Walkthrough

Walkthrough

A new AVX2 softmax kernel (softmax_v6) is introduced with optimized range reduction via a single FMA constant, paired with a new exp256_ps_v3 helper, comprehensive tests validating numerical correctness, and benchmark registration for performance measurement.

Changes

Softmax v6 optimization

Layer / File(s)	Summary
Optimization documentation `.jules/thunderbolt.md`	Journal entry describing the softmax optimization strategy combining range-reduction constants into a single FMA, with benchmark evidence and usage checklist for shift-invariant domains.
Core softmax optimization `ml_kernels/include/ml_kernels/softmax.h`	`exp256_ps_v3` performs exponential computation via single-FMA range reduction (`cvtps_epi32` rounding, `FNMADD` with ln(2), Horner-style FMA polynomial chain). `softmax_v6` processes 32 elements per loop iteration using four independent accumulators for max and sum reduction, scalar tail via `std::exp`, and normalization by reciprocal multiplication.
Testing and benchmarking `ml_kernels/src/test_naive_ops.cpp`, `ml_kernels/src/kernel_bench.cpp`	`test_softmax_v6()` validates output correctness against `softmax_naive` and verifies probability sum ≈ 1. `SoftmaxV6Benchmark` subclass registers the new kernel in the benchmark suite for performance comparison.

Sequence Diagram

sequenceDiagram
  participant Input as Input buffer
  participant VectorMax as Vector max<br/>(4 accumulators)
  participant Reduce_Max as reduce_max
  participant VectorExp as exp256_ps_v3<br/>(4 accumulators)
  participant Reduce_Sum as reduce_sum
  participant ScalarExp as std::exp<br/>(tail)
  participant Normalize as Normalize<br/>(reciprocal · output)
  participant Output as Output buffer
  Input->>VectorMax: load 32 elements per iteration
  VectorMax->>Reduce_Max: reduce four accumulators
  Input->>VectorExp: compute exp for 32 elements per iteration
  VectorExp->>Output: store exponent results
  VectorExp->>Reduce_Sum: accumulate four sums
  Reduce_Sum->>ScalarExp: reduce to scalar, process tail
  ScalarExp->>Normalize: final sum
  Normalize->>Output: multiply by 1/sum

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

bugparty/cpu_math_kernels_pri#31: Introduces softmax_v5 with similar exp256_ps_v* range-reduction pattern using cvtps_epi32 and FMA-based polynomial approximation in the same softmax header.
bugparty/cpu_math_kernels_pri#7: Adds foundational softmax test infrastructure using softmax_naive as reference, which is reused in this PR's correctness validation.

Poem

🐰 A softmax hops with FMA grace,
Four accumulators keep the pace,
Exp256 whispers through the lane,
One constant merges, reduced strain,
Probabilities bloom, normalized and bright! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and concisely summarizes the main change: introducing single FMA range reduction for the softmax kernel (v6). It directly corresponds to the primary optimization described in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt-softmax-single-fma-17482068372732358033

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

ml_kernels/src/test_naive_ops.cpp (1)
184-211: ⚡ Quick win

Consider adding edge-case coverage for scalar tail and 8-element remainder paths.

The test input has exactly 32 elements, which exercises only the main 32-element unrolled loop. The 8-element loop (lines 235-240 in softmax.h) and scalar tail (lines 243-247) are not exercised. Adding inputs with sizes like 33 or 41 elements would improve coverage.

This matches the existing test pattern for v3–v5, so it's optional to address now.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 184 - 211, The test
test_softmax_v6 only covers a 32-element input and misses the 8-element
remainder and scalar tail paths in softmax_v6; add additional cases in
test_softmax_v6 (or new tests) that call ml_kernels::softmax_v6 with input sizes
that exercise the 8-element remainder and scalar tail (e.g., 33 and 41 elements)
and validate output equality with ml_kernels::softmax_naive and that the result
sums to 1.0f, similar to the existing assertions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 184-211: The test test_softmax_v6 only covers a 32-element input
and misses the 8-element remainder and scalar tail paths in softmax_v6; add
additional cases in test_softmax_v6 (or new tests) that call
ml_kernels::softmax_v6 with input sizes that exercise the 8-element remainder
and scalar tail (e.g., 33 and 41 elements) and validate output equality with
ml_kernels::softmax_naive and that the result sums to 1.0f, similar to the
existing assertions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 92d9ca68-3b58-4119-a8a6-36ad72541248

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and ebe219f.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: softmax — Single FMA Range Reduction#40

⚡ Thunderbolt: softmax — Single FMA Range Reduction#40
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-single-fma-17482068372732358033

bugparty commented May 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bugparty commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented May 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading