Skip to content

⚡ Thunderbolt: softmax — Single FMA Range Reduction#40

Open
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-single-fma-17482068372732358033
Open

⚡ Thunderbolt: softmax — Single FMA Range Reduction#40
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-single-fma-17482068372732358033

Conversation

@bugparty
Copy link
Copy Markdown
Owner

@bugparty bugparty commented May 22, 2026

💡 What: Added softmax_v6 containing a new exp256_ps_v3 evaluation function. This eliminates the two-part r = x - n * ln2_hi - n * ln2_lo range reduction step and instead folds it into a single _mm256_fnmadd_ps.

🎯 Why: Breaking the precision of the range reduction down into high and low parts costs execution cycles and registers. Since the softmax kernel inherently evaluates relative shifted constants (by previously subtracting the maximum array element from everything), absolute numerical floating point perfection in exp256 evaluates to negligible shifts at the softmax output.

🏗️ How: exp256_ps_v3 was introduced alongside softmax_v6 using r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x); replacing the sequence of two independent fnmadd_ps operations in exp256_ps_v2.

📊 Impact: Benchmarks indicate throughput on 1MB structures (N=1048576) scaled from 3.73 GFLOP/s up to 3.97 GFLOP/s, a roughly ~6.4% jump in execution throughput. Tests mapping deviation from absolute correctness show no values drift past 1e-4 tolerance limits.

🖥️ Tested on: Linux Sandbox, AVX2 compatible Haswell microarchitecture runtime.

🔬 How to reproduce:

DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter 'softmax' --iters 500 --warmup 50 --sizes 1048576

PR created automatically by Jules for task 17482068372732358033 started by @bugparty

Summary by CodeRabbit

  • New Features

    • Added an optimized softmax implementation that improves computational throughput through enhanced exponential calculations and refined range reduction techniques.
  • Tests

    • Added comprehensive tests validating the new implementation against reference implementations, including probability normalization verification.
  • Documentation

    • Added development notes documenting the optimization approach, including detailed performance analysis and correctness considerations.

Review Change Stack

…ingle FMA

Added `softmax_v6` utilizing `exp256_ps_v3` which combines the `r = x - n * ln(2)`
range reduction step into a single `_mm256_fnmadd_ps` instruction. Precision loss
from avoiding the high/low split is acceptable within `1e-4` precision tolerance
due to softmax's shift-invariance. This reduces instruction latency and dependency
chain size on the critical path.

Measured an improvement from 3.73 GFLOP/s to 3.97 GFLOP/s for N=1048576 arrays.

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

📝 Walkthrough

Walkthrough

A new AVX2 softmax kernel (softmax_v6) is introduced with optimized range reduction via a single FMA constant, paired with a new exp256_ps_v3 helper, comprehensive tests validating numerical correctness, and benchmark registration for performance measurement.

Changes

Softmax v6 optimization

Layer / File(s) Summary
Optimization documentation
.jules/thunderbolt.md
Journal entry describing the softmax optimization strategy combining range-reduction constants into a single FMA, with benchmark evidence and usage checklist for shift-invariant domains.
Core softmax optimization
ml_kernels/include/ml_kernels/softmax.h
exp256_ps_v3 performs exponential computation via single-FMA range reduction (cvtps_epi32 rounding, FNMADD with ln(2), Horner-style FMA polynomial chain). softmax_v6 processes 32 elements per loop iteration using four independent accumulators for max and sum reduction, scalar tail via std::exp, and normalization by reciprocal multiplication.
Testing and benchmarking
ml_kernels/src/test_naive_ops.cpp, ml_kernels/src/kernel_bench.cpp
test_softmax_v6() validates output correctness against softmax_naive and verifies probability sum ≈ 1. SoftmaxV6Benchmark subclass registers the new kernel in the benchmark suite for performance comparison.

Sequence Diagram

sequenceDiagram
  participant Input as Input buffer
  participant VectorMax as Vector max<br/>(4 accumulators)
  participant Reduce_Max as reduce_max
  participant VectorExp as exp256_ps_v3<br/>(4 accumulators)
  participant Reduce_Sum as reduce_sum
  participant ScalarExp as std::exp<br/>(tail)
  participant Normalize as Normalize<br/>(reciprocal · output)
  participant Output as Output buffer
  Input->>VectorMax: load 32 elements per iteration
  VectorMax->>Reduce_Max: reduce four accumulators
  Input->>VectorExp: compute exp for 32 elements per iteration
  VectorExp->>Output: store exponent results
  VectorExp->>Reduce_Sum: accumulate four sums
  Reduce_Sum->>ScalarExp: reduce to scalar, process tail
  ScalarExp->>Normalize: final sum
  Normalize->>Output: multiply by 1/sum
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • bugparty/cpu_math_kernels_pri#31: Introduces softmax_v5 with similar exp256_ps_v* range-reduction pattern using cvtps_epi32 and FMA-based polynomial approximation in the same softmax header.
  • bugparty/cpu_math_kernels_pri#7: Adds foundational softmax test infrastructure using softmax_naive as reference, which is reused in this PR's correctness validation.

Poem

🐰 A softmax hops with FMA grace,
Four accumulators keep the pace,
Exp256 whispers through the lane,
One constant merges, reduced strain,
Probabilities bloom, normalized and bright! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: introducing single FMA range reduction for the softmax kernel (v6). It directly corresponds to the primary optimization described in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt-softmax-single-fma-17482068372732358033

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ml_kernels/src/test_naive_ops.cpp (1)

184-211: ⚡ Quick win

Consider adding edge-case coverage for scalar tail and 8-element remainder paths.

The test input has exactly 32 elements, which exercises only the main 32-element unrolled loop. The 8-element loop (lines 235-240 in softmax.h) and scalar tail (lines 243-247) are not exercised. Adding inputs with sizes like 33 or 41 elements would improve coverage.

This matches the existing test pattern for v3–v5, so it's optional to address now.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 184 - 211, The test
test_softmax_v6 only covers a 32-element input and misses the 8-element
remainder and scalar tail paths in softmax_v6; add additional cases in
test_softmax_v6 (or new tests) that call ml_kernels::softmax_v6 with input sizes
that exercise the 8-element remainder and scalar tail (e.g., 33 and 41 elements)
and validate output equality with ml_kernels::softmax_naive and that the result
sums to 1.0f, similar to the existing assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 184-211: The test test_softmax_v6 only covers a 32-element input
and misses the 8-element remainder and scalar tail paths in softmax_v6; add
additional cases in test_softmax_v6 (or new tests) that call
ml_kernels::softmax_v6 with input sizes that exercise the 8-element remainder
and scalar tail (e.g., 33 and 41 elements) and validate output equality with
ml_kernels::softmax_naive and that the result sums to 1.0f, similar to the
existing assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 92d9ca68-3b58-4119-a8a6-36ad72541248

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and ebe219f.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant