Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ minutes_to_complete: 15
who_is_this_for: This is an introductory topic for developers interested in implementing the exponential function and optimizing it. The Scalable Vector Extension (SVE), introduced with the Armv8-A architecture, includes a dedicated instruction, FEXPA. Although initially not supported in SME, the FEXPA instruction has been made available in Scalable Matrix Extension (SME) version 2.2.

learning_objectives:
- Implementing with SVE intrinsics the exponential function
- Optimizing it with FEXPA
- Implement the exponential function using SVE intrinsics
- Optimize the function with FEXPA

prerequisites:
- An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers.
- Access to an [AWS Graviton4, Google Axion, or Azure Cobalt 100 virtual machine from a cloud service provider](/learning-paths/servers-and-cloud-computing/csp/).
- Some familiarity with SIMD programming and SVE intrinsics.

author:
Expand All @@ -23,6 +23,10 @@ author:
- Alexandre Romana

further_reading:
- resource:
title: Arm Optimized Routines
link: https://github.com/ARM-software/optimized-routines
type: website
- resource:
title: Scalable Vector Extensions documentation
link: https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ layout: learningpathall
---

## Conclusion
The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation.
The SVE FEXPA instruction can speed-up the computation of the exponential functions by implementing table lookup and bit manipulation. The exponential function is the core of the Softmax function that, with the shift toward Generative AI, has become a critical component of modern neural network architectures.

In conclusion, SME-support for FEXPA lets you embed the expensive exponential approximation directly into the matrix computation path. That translates into:
An implementation of the exponential function based on FEXPA can achieve a specified target precision using a polynomial of lower degree than that required by alternative implementations. Moreover, SME support for FEXPA lets you embed the exponential approximation directly into the matrix computation path and that translates into:
- Fewer instructions (no back-and-forth to scalar/SVE code)
- Potentially higher aggregate throughput (more exponentials per cycle)
- Lower power & bandwidth (data being kept in SME engine)
- Lower power & bandwidth (data being kept in the SME engine)
- Cleaner fusion with GEMM/GEMV workloads

All of which makes all exponential heavy workloads significantly faster on ARM CPUs.
120 changes: 103 additions & 17 deletions content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ layout: learningpathall

Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator.

Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits.
Let’s segment the IEEE 754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of _Idxb_, _Expb_ and _Remb_ bits.

| IEEE754 precision | Idxb | Expb | Remb |
| IEEE 754 precision | Idxb | Expb | Remb |
|-------------------------|------|------|------|
| Half precision (FP16) | 5 | 5 | 0 |
| Single precision (FP32) | 6 | 8 | 9 |
Expand Down Expand Up @@ -45,26 +45,112 @@ $$ e^x = 2^m \times T[j] \times (1 + p(r)) $$
With a table of size 2^L, the evaluation interval for the approximation polynomial is narrowed by a factor of 2^L. This reduction leads to improved accuracy for a given polynomial degree due to the narrower approximation range. Alternatively, for a given accuracy target, the degree of the polynomial—and hence its computational complexity—can be reduced.

## Exponential implementation with FEXPA
FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy of the implementation we have seen before:

FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy as the degree-4 polynomial implementation from the previous section.

### Add the FEXPA implementation

Open your `exp_sve.c` file and add the following function after the `exp_sve()` function:

```C
// SVE exponential implementation with FEXPA (degree-2 polynomial)
void exp_sve_fexpa(float *x, float *y, size_t n) {
// FEXPA-specific coeffs
const float c0_fexpa = 1.000003695487976f; // 0x1.00003ep0
const float c1_fexpa = 0.5000003576278687f; // 0x1.00000cp-1
const float shift_fexpa = 196735.0f; // 0x1.803f8p17f

size_t i = 0;

const svfloat32_t ln2lo_vec = svdup_f32(ln2_lo);

while (i < n) {
const svbool_t pg = svwhilelt_b32((uint64_t)i, (uint64_t)n);
svfloat32_t x_vec = svld1(pg, &x[i]);

/* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */
svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x_vec, svdup_f32(shift_fexpa));
svfloat32_t k = svsub_x(pg, z, svdup_f32(shift_fexpa));

/* Compute r as x - k*ln2 with Cody and Waite */
svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x_vec);
r = svmls_lane_f32(r, k, ln2lo_vec, 0);

/* Compute the scaling factor 2^k using FEXPA */
svfloat32_t scale = svexpa(svreinterpret_u32_f32(z));

/* Compute poly(r) = exp(r) - 1 (degree-2 polynomial) */
svfloat32_t p01 = svmla_x(pg, svdup_f32(c0_fexpa), r, svdup_f32(c1_fexpa)); // c0 + c1 * r
svfloat32_t poly = svmul_x(pg, r, p01); // r * (c0 + c1 * r)

/* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */
svfloat32_t result = svmla_f32_x(pg, scale, poly, scale);

svst1(pg, &y[i], result);
i += svcntw();
}
}
```

{{% notice Arm Optimized Routines %}}
This implementation can be found in [ARM Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/ba35b32/math/aarch64/sve/sv_expf_inline.h).
{{% /notice %}}


Now register this new implementation in the `implementations` array in `main()`. Find this section:

```C
exp_impl_t implementations[] = {
{"Baseline (expf)", exp_baseline},
{"SVE (degree-4 poly)", exp_sve},
// Add more implementations here as you develop them
};
```

Add your FEXPA implementation to the array:

```C
svfloat32_t lane_consts = svld1rq(pg, ln2_lo); // Load only ln2_lo
exp_impl_t implementations[] = {
{"Baseline (expf)", exp_baseline},
{"SVE (degree-4 poly)", exp_sve},
{"SVE+FEXPA (degree-2)", exp_sve_fexpa},
};
```

## Compile and compare

/* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */
svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x, shift);
svfloat32_t k = svsub_x(pg, z, shift);
Recompile the program:

/* Compute r as x - k*ln2 with Cody and Waite */
svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x);
r = svmls_lane(r, k, lane_consts, 0);
```bash
gcc -O3 -march=armv8-a+sve exp_sve.c -o exp_sve -lm
```

Run the benchmark:

/* Compute the scaling factor 2^k */
svfloat32_t scale = svexpa(svreinterpret_u32(z));
```bash
./exp_sve
```

/* Compute poly(r) = exp(r) - 1 (2nd degree polynomial) */
svfloat32_t p01 = svmla_x (pg, svdup_f32(c0), r, svdup_f32(c1)); // c0 + c1 * r
svfloat32_t poly = svmul_x (pg, r, p01); // r c0 + c1 * r^2
The output shows the final comparison:

/* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */
svfloat32_t result = svmla_f32_x(pg, scale, poly, scale);
```output
Performance Results:
Implementation Time (sec) Speedup vs Baseline
------------- ----------- -------------------
Baseline (expf) 0.002462 1.00x
SVE (degree-4 poly) 0.000578 4.26x
SVE+FEXPA (degree-2) 0.000414 5.95x
```

## Results analysis

The benchmark shows the performance progression:

1. **SVE with degree-4 polynomial**: Provides up to 4x speedup through vectorization
2. **SVE with FEXPA and degree-2 polynomial**: Achieves an additional 1-2x improvement

The FEXPA instruction delivers this improvement by:
- Replacing manual bit manipulation with a single hardware instruction (`svexpa()`)
- Enabling a simpler polynomial (degree-2 instead of degree-4) while maintaining accuracy

Both SVE implementations maintain comparable accuracy (errors in the 10^-9 to 10^-10 range), demonstrating that specialized hardware instructions can significantly improve performance without sacrificing precision.
Loading