ArmDeveloperEcosystem · annietllnd · Dec 17, 2025 · Dec 17, 2025 · Dec 19, 2025 · Dec 19, 2025
diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md b/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md
@@ -10,11 +10,11 @@ minutes_to_complete: 15
 who_is_this_for: This is an introductory topic for developers interested in implementing the exponential function and optimizing it. The Scalable Vector Extension (SVE), introduced with the Armv8-A architecture, includes a dedicated instruction, FEXPA. Although initially not supported in SME, the FEXPA instruction has been made available in Scalable Matrix Extension (SME) version 2.2.
 
 learning_objectives: 
-    - Implementing with SVE intrinsics the exponential function
-    - Optimizing it with FEXPA
+    - Implement the exponential function using SVE intrinsics
+    - Optimize the function with FEXPA
 
 prerequisites:
-    - An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers. 
+    - Access to an [AWS Graviton4, Google Axion, or Azure Cobalt 100 virtual machine from a cloud service provider](/learning-paths/servers-and-cloud-computing/csp/).
     - Some familiarity with SIMD programming and SVE intrinsics.
 
 author: 
@@ -23,6 +23,10 @@ author:
 - Alexandre Romana
 
 further_reading:
+    - resource:
+        title: Arm Optimized Routines
+        link: https://github.com/ARM-software/optimized-routines
+        type: website
     - resource:
         title: Scalable Vector Extensions documentation
         link: https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions

diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md b/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md
@@ -7,12 +7,12 @@ layout: learningpathall
 ---
 
 ## Conclusion
-The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation. 
+The SVE FEXPA instruction can speed-up the computation of the exponential functions by implementing table lookup and bit manipulation. The exponential function is the core of the Softmax function that, with the shift toward Generative AI, has become a critical component of modern neural network architectures.
 
-In conclusion, SME-support for FEXPA lets you embed the expensive exponential approximation directly into the matrix computation path.  That translates into:
+An implementation of the exponential function based on FEXPA can achieve a specified target precision using a polynomial of lower degree than that required by alternative implementations. Moreover, SME support for FEXPA lets you embed the exponential approximation directly into the matrix computation path and that translates into:
 - Fewer instructions (no back-and-forth to scalar/SVE code)
 - Potentially higher aggregate throughput (more exponentials per cycle)
-- Lower power & bandwidth (data being kept in SME engine)
+- Lower power & bandwidth (data being kept in the SME engine)
 - Cleaner fusion with GEMM/GEMV workloads
 
 All of which makes all exponential heavy workloads significantly faster on ARM CPUs.
diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md
@@ -10,9 +10,9 @@ layout: learningpathall
 
 Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator. 
 
-Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits.
+Let’s segment the IEEE 754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of _Idxb_, _Expb_ and _Remb_ bits.
 
-| IEEE754 precision       | Idxb | Expb | Remb |
+| IEEE 754 precision       | Idxb | Expb | Remb |
 |-------------------------|------|------|------|
 | Half precision (FP16)   | 5    | 5    | 0    |
 | Single precision (FP32) | 6    | 8    | 9    |
@@ -45,26 +45,112 @@ $$ e^x = 2^m \times T[j] \times (1 + p(r)) $$
 With a table of size 2^L, the evaluation interval for the approximation polynomial is narrowed  by a factor of 2^L. This reduction leads to improved accuracy for a given polynomial degree due to the narrower approximation range. Alternatively, for a given accuracy target, the degree of the polynomial—and hence its computational complexity—can be reduced.
 
 ## Exponential implementation with FEXPA
-FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy of the implementation we have seen before:
+
+FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy as the degree-4 polynomial implementation from the previous section.
+
+### Add the FEXPA implementation
+
+Open your `exp_sve.c` file and add the following function after the `exp_sve()` function:
+
+```C
+// SVE exponential implementation with FEXPA (degree-2 polynomial)
+void exp_sve_fexpa(float *x, float *y, size_t n) {
+    // FEXPA-specific coeffs
+    const float c0_fexpa = 1.000003695487976f;   // 0x1.00003ep0
+    const float c1_fexpa = 0.5000003576278687f;  // 0x1.00000cp-1
+    const float shift_fexpa = 196735.0f;         // 0x1.803f8p17f
+
+    size_t i = 0;
+
+    const svfloat32_t ln2lo_vec = svdup_f32(ln2_lo);
+
+    while (i < n) {
+        const svbool_t pg = svwhilelt_b32((uint64_t)i, (uint64_t)n);
+        svfloat32_t x_vec = svld1(pg, &x[i]);
+
+        /* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */
+        svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x_vec, svdup_f32(shift_fexpa));
+        svfloat32_t k = svsub_x(pg, z, svdup_f32(shift_fexpa));
+
+        /* Compute r as x - k*ln2 with Cody and Waite */
+        svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x_vec);
+                    r = svmls_lane_f32(r, k, ln2lo_vec, 0);
+
+        /* Compute the scaling factor 2^k using FEXPA */
+        svfloat32_t scale = svexpa(svreinterpret_u32_f32(z));
+
+        /* Compute poly(r) = exp(r) - 1 (degree-2 polynomial) */
+        svfloat32_t p01  = svmla_x(pg, svdup_f32(c0_fexpa), r, svdup_f32(c1_fexpa)); // c0 + c1 * r
+        svfloat32_t poly = svmul_x(pg, r, p01); // r * (c0 + c1 * r)
+
+        /* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */
+        svfloat32_t result = svmla_f32_x(pg, scale, poly, scale);
+
+        svst1(pg, &y[i], result);
+        i += svcntw();
+    }
+}
+```
+
+{{% notice Arm Optimized Routines %}}
+This implementation can be found in [ARM Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/ba35b32/math/aarch64/sve/sv_expf_inline.h).
+{{% /notice %}}
+
+
+Now register this new implementation in the `implementations` array in `main()`. Find this section:
+
+```C
+    exp_impl_t implementations[] = {
+        {"Baseline (expf)", exp_baseline},
+        {"SVE (degree-4 poly)", exp_sve},
+        // Add more implementations here as you develop them
+    };
+```
+
+Add your FEXPA implementation to the array:
 
 ```C
-svfloat32_t lane_consts = svld1rq(pg, ln2_lo); // Load only ln2_lo
+    exp_impl_t implementations[] = {
+        {"Baseline (expf)", exp_baseline},
+        {"SVE (degree-4 poly)", exp_sve},
+        {"SVE+FEXPA (degree-2)", exp_sve_fexpa},
+    };
+```
+
+## Compile and compare
 
-/* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */
-svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x, shift);
-svfloat32_t k = svsub_x(pg, z, shift);
+Recompile the program:
 
-/* Compute r as x - k*ln2 with Cody and Waite */
-svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x);
-            r = svmls_lane(r, k, lane_consts, 0);
+```bash
+gcc -O3 -march=armv8-a+sve exp_sve.c -o exp_sve -lm
+```
+
+Run the benchmark:
 
-/* Compute the scaling factor 2^k */
-svfloat32_t scale = svexpa(svreinterpret_u32(z));
+```bash
+./exp_sve
+```
 
-/* Compute poly(r) = exp(r) - 1 (2nd degree polynomial) */
-svfloat32_t p01 = svmla_x (pg, svdup_f32(c0), r, svdup_f32(c1)); // c0 + c1 * r
-svfloat32_t poly = svmul_x (pg, r, p01); // r c0 + c1 * r^2
+The output shows the final comparison:
 
-/* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */
-svfloat32_t result = svmla_f32_x(pg, scale, poly, scale);
+```output
+Performance Results:
+Implementation              Time (sec) Speedup vs Baseline
+-------------              ----------- -------------------
+Baseline (expf)               0.002462            1.00x
+SVE (degree-4 poly)           0.000578            4.26x
+SVE+FEXPA (degree-2)          0.000414            5.95x
 ```
+
+## Results analysis
+
+The benchmark shows the performance progression:
+
+1. **SVE with degree-4 polynomial**: Provides up to 4x speedup through vectorization
+2. **SVE with FEXPA and degree-2 polynomial**: Achieves an additional 1-2x improvement
+
+The FEXPA instruction delivers this improvement by:
+- Replacing manual bit manipulation with a single hardware instruction (`svexpa()`)
+- Enabling a simpler polynomial (degree-2 instead of degree-4) while maintaining accuracy
+
+Both SVE implementations maintain comparable accuracy (errors in the 10^-9 to 10^-10 range), demonstrating that specialized hardware instructions can significantly improve performance without sacrificing precision.