From 932e78b4064dcff48161c4d2085c3e19ce0e1a44 Mon Sep 17 00:00:00 2001 From: Annie Tallund Date: Wed, 17 Dec 2025 11:41:05 -0800 Subject: [PATCH 1/5] Technical review of FEXPA LP --- .../fexpa/_index.md | 6 +- .../fexpa/fexpa.md | 115 +++++++-- .../fexpa/implementation.md | 229 ++++++++++++++++-- .../fexpa/theory.md | 1 + 4 files changed, 310 insertions(+), 41 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md b/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md index 4f64d1d84f..2daed651d0 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md @@ -10,11 +10,11 @@ minutes_to_complete: 15 who_is_this_for: This is an introductory topic for developers interested in implementing the exponential function and optimizing it. The Scalable Vector Extension (SVE), introduced with the Armv8-A architecture, includes a dedicated instruction, FEXPA. Although initially not supported in SME, the FEXPA instruction has been made available in Scalable Matrix Extension (SME) version 2.2. learning_objectives: - - Implementing with SVE intrinsics the exponential function - - Optimizing it with FEXPA + - Implement the exponential function using SVE intrinsics + - Optimize the function with FEXPA prerequisites: - - An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers. + - Access to an [AWS Graviton4, Google Axion, or Azure Cobalt 100 virtual machine from a cloud service provider](/learning-paths/servers-and-cloud-computing/csp/). - Some familiarity with SIMD programming and SVE intrinsics. author: diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md index 0de96ebb24..b4d84508da 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md @@ -10,7 +10,7 @@ layout: learningpathall Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator. -Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits. +Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of _Idxb_, _Expb_ and _Remb_ bits. | IEEE754 precision | Idxb | Expb | Remb | |-------------------------|------|------|------| @@ -18,7 +18,7 @@ Let’s segment the IEEE754 floating-point representation fraction part into sev | Single precision (FP32) | 6 | 8 | 9 | | Double precision (FP64) | 6 | 11 | 35 | -The FEXPA instruction can be described for any real number x ∈ [2^(Remb + Expb) + 1; 2^(Remb + Expb) + 2^Expb - 1) as: +The FEXPA instruction can be described for any real number x ∈ [2^(remBits + expBits) + 1; 2^(remBits + expBits) + 2^expBits - 1) as: $$FEXPA(x)=2^{x-constant}$$ @@ -44,27 +44,106 @@ $$ e^x = 2^m \times T[j] \times (1 + p(r)) $$ With a table of size 2^L, the evaluation interval for the approximation polynomial is narrowed by a factor of 2^L. This reduction leads to improved accuracy for a given polynomial degree due to the narrower approximation range. Alternatively, for a given accuracy target, the degree of the polynomial—and hence its computational complexity—can be reduced. -## Exponential implementation wth FEXPA -FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy of the implementation we have seen before: +## Exponential implementation with FEXPA + +FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy as the degree-4 polynomial implementation from the previous section. + +### Add the FEXPA implementation + +Open your `exp_sve.c` file and add the following function after the `exp_sve()` function: ```C -svfloat32_t lane_consts = svld1rq(pg, ln2_lo); // Load only ln2_lo +// SVE exponential implementation with FEXPA (degree-2 polynomial) +void exp_sve_fexpa(float *x, float *y, size_t n) { + float constants[1] = {ln2_lo}; + size_t i = 0; + + svbool_t pg = svptrue_b32(); + + while (i < n) { + pg = svwhilelt_b32((uint64_t)i, (uint64_t)n); + svfloat32_t x_vec = svld1(pg, &x[i]); + + svfloat32_t lane_consts = svld1rq(pg, constants); + + /* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */ + svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x_vec, shift); + svfloat32_t k = svsub_x(pg, z, shift); + + /* Compute r as x - k*ln2 with Cody and Waite */ + svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x_vec); + r = svmls_lane(r, k, lane_consts, 0); + + /* Compute the scaling factor 2^k using FEXPA */ + svfloat32_t scale = svexpa(svreinterpret_u32_f32(z)); + + /* Compute poly(r) = exp(r) - 1 (degree-2 polynomial) */ + svfloat32_t p01 = svmla_x(pg, svdup_f32(c0), r, svdup_f32(c1)); // c0 + c1 * r + svfloat32_t poly = svmul_x(pg, r, p01); // r * (c0 + c1 * r) + + /* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */ + svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); + + svst1(pg, &y[i], result); + i += svcntw(); + } +} +``` -/* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */ -svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x, shift); -svfloat32_t k = svsub_x(pg, z, shift); +Now register this new implementation in the `implementations` array in `main()`. Find this section: -/* Compute r as x - k*ln2 with Cody and Waite */ -svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x); - r = svmls_lane(r, k, lane_consts, 0); +```C + exp_impl_t implementations[] = { + {"Baseline (expf)", exp_baseline}, + {"SVE (degree-4 poly)", exp_sve}, + // Add more implementations here as you develop them + }; +``` -/* Compute the scaling factor 2^k */ -svfloat32_t scale = svexpa(svreinterpret_u32(z)); +Add your FEXPA implementation to the array: -/* Compute poly(r) = exp(r) - 1 (2nd degree polynomial) */ -svfloat32_t p01 = svmla_x (pg, svdup_f32(c0), r, svdup_f32(c1)); // c0 + c1 * r -svfloat32_t poly = svmul_x (pg, r, p01); // r c0 + c1 * r^2 +```C + exp_impl_t implementations[] = { + {"Baseline (expf)", exp_baseline}, + {"SVE (degree-4 poly)", exp_sve}, + {"SVE+FEXPA (degree-2)", exp_sve_fexpa}, + }; +``` + +## Compile and compare -/* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */ -svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); +Recompile the program: + +```bash +gcc -O3 -march=armv8-a+sve exp_sve.c -o exp_sve -lm ``` + +Run the benchmark: + +```bash +./exp_sve +``` + +The output shows the final comparison: + +```output +Performance Results: +Implementation Time (sec) Speedup vs Baseline +------------- ----------- ------------------- +Baseline (expf) 0.004523 1.00x +SVE (degree-4 poly) 0.002187 4.36x +SVE+FEXPA (degree-2) 0.001453 6.09x +``` + +## Results analysis + +The benchmark shows the performance progression: + +1. **SVE with degree-4 polynomial**: Provides up to 4x speedup through vectorization +2. **SVE with FEXPA and degree-2 polynomial**: Achieves an additional 1-2x improvement + +The FEXPA instruction delivers this improvement by: +- Replacing manual bit manipulation with a single hardware instruction (`svexpa()`) +- Enabling a simpler polynomial (degree-2 instead of degree-4) while maintaining accuracy + +Both SVE implementations maintain comparable accuracy (errors in the 10^-9 to 10^-10 range), demonstrating that specialized hardware instructions can significantly improve performance without sacrificing precision. diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md index 016a4705b8..62d5cc70b2 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md @@ -6,31 +6,220 @@ weight: 3 layout: learningpathall --- -## First implementation -Given what we said in the previous chapters, the exponential function can be implemented with SVE intrinsics in the following way: +## Implement the exponential function + +Based on the theory covered in the previous section, you can implement the exponential function using SVE intrinsics with polynomial approximation. This Learning Path was tested using a AWS Graviton4 instance type `r8g.medium`. + +## Set up your environment + +To run the example, you will need `gcc`. + +```bash +sudo apt update +sudo apt -y install gcc +``` + +Create a new file named `exp_sve.c` with the following implementation: ```C -svfloat32_t lane_consts = svld1rq(pg, constants); // Load ln2_lo, c0, c2, c4 in register +#include +#include +#include +#include +#include + +// Constants for exponential computation +static const float ln2_hi = 0.693359375f; +static const float ln2_lo = -2.12194440e-4f; +static const float inv_ln2 = 1.4426950216f; +static const float shift = 12582912.0f; // 1.5*2^23 + 127 +static const float c0 = 0.999999995f; +static const float c1 = 1.00000000f; +static const float c2 = 0.499991506f; +static const float c3 = 0.166676521f; +static const float c4 = 0.0416637054f; + +// Baseline exponential using standard library +void exp_baseline(float *x, float *y, size_t n) { + for (size_t i = 0; i < n; i++) { + y[i] = expf(x[i]); + } +} + +// SVE exponential implementation +void exp_sve(float *x, float *y, size_t n) { + float constants[4] = {ln2_lo, c0, c2, c4}; + size_t i = 0; + + svbool_t pg = svptrue_b32(); + + while (i < n) { + pg = svwhilelt_b32((uint64_t)i, (uint64_t)n); + svfloat32_t x_vec = svld1(pg, &x[i]); + + svfloat32_t lane_consts = svld1rq(pg, constants); + + /* Compute k as round(x/ln2) using shift = 1.5*2^23 + 127 */ + svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x_vec, shift); + svfloat32_t k = svsub_x(pg, z, shift); + + /* Compute r as x - k*ln2 with Cody and Waite */ + svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x_vec); + r = svmls_lane(r, k, lane_consts, 0); + + /* Compute the scaling factor 2^k */ + svfloat32_t scale = svreinterpret_f32_u32( + svlsl_n_u32_m(pg, svreinterpret_u32_f32(z), 23)); + + /* Compute poly(r) = exp(r) - 1 */ + svfloat32_t p12 = svmla_lane(svdup_f32(c1), r, lane_consts, 2); // c1 + c2 * r + svfloat32_t p34 = svmla_lane(svdup_f32(c3), r, lane_consts, 3); // c3 + c4 * r + svfloat32_t r2 = svmul_x(pg, r, r); + svfloat32_t p14 = svmla_x(pg, p12, p34, r2); // c1 + c2 * r + c3 * r^2 + c4 * r^3 + svfloat32_t p0 = svmul_lane(r, lane_consts, 1); // c0 * r + svfloat32_t poly = svmla_x(pg, p0, r2, p14); // c0 * r + c1 * r^2 + c2 * r^3 + c3 * r^4 + c4 * r^5 + + /* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */ + svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); + + svst1(pg, &y[i], result); + i += svcntw(); + } +} -/* Compute k as round(x/ln2) using shift = 1.5*2^23 + 127 */ -svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x, shift); -svfloat32_t k = svsub_x(pg, z, shift); +// Benchmark function +double benchmark(void (*func)(float*, float*, size_t), + float *input, float *output, size_t n, int iterations) { + struct timespec start, end; + + clock_gettime(CLOCK_MONOTONIC, &start); + for (int i = 0; i < iterations; i++) { + func(input, output, n); + } + clock_gettime(CLOCK_MONOTONIC, &end); + + double elapsed = (end.tv_sec - start.tv_sec) + + (end.tv_nsec - start.tv_nsec) / 1e9; + return elapsed / iterations; +} -/* Compute r as x - k*ln2 with Cody and Waite */ -svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x); - r = svmls_lane(r, k, lane_consts, 0); +// Structure to hold implementation info +typedef struct { + const char *name; + void (*func)(float*, float*, size_t); +} exp_impl_t; -/* Compute the scaling factor 2^k */ -svfloat32_t scale = svreinterpret_f32_u32(svlsl_n_u32_m(pg, svreinterpret_u32_f32(z), 23)); +int main() { + const size_t n = 1000000; + const int iterations = 100; + + // List all available implementations here + // Add new implementations to this array to automatically benchmark them + exp_impl_t implementations[] = { + {"Baseline (expf)", exp_baseline}, + {"SVE (degree-4 poly)", exp_sve}, + // Add more implementations here as you develop them + }; + int num_impls = sizeof(implementations) / sizeof(implementations[0]); + + // Allocate aligned memory + float *input = aligned_alloc(64, n * sizeof(float)); + float **outputs = malloc(num_impls * sizeof(float*)); + for (int i = 0; i < num_impls; i++) { + outputs[i] = aligned_alloc(64, n * sizeof(float)); + } + + // Initialize input with test values + for (size_t i = 0; i < n; i++) { + input[i] = -5.0f + (10.0f * i) / n; // Range: [-5, 5] + } + + printf("Benchmarking exponential function with %zu elements...\n", n); + printf("Running %d iterations for accuracy\n\n", iterations); + + // Benchmark all implementations + double *times = malloc(num_impls * sizeof(double)); + for (int i = 0; i < num_impls; i++) { + times[i] = benchmark(implementations[i].func, input, outputs[i], n, iterations); + } + + // Verify accuracy + printf("Sample accuracy check (first 5 values):\n"); + for (int i = 0; i < 5; i++) { + printf(" x=%.2f: ", input[i]); + for (int j = 0; j < num_impls; j++) { + if (j > 0) { + float error = fabsf(outputs[j][i] - outputs[0][i]); + printf("%s=%.6f (error=%.2e)", + implementations[j].name, outputs[j][i], error); + } else { + printf("%s=%.6f", implementations[j].name, outputs[j][i]); + } + if (j < num_impls - 1) printf(", "); + } + printf("\n"); + } + + // Display performance results + printf("\nPerformance Results:\n"); + printf("%-25s %12s %15s\n", "Implementation", "Time (sec)", "Speedup vs Baseline"); + printf("%-25s %12s %15s\n", "-------------", "-----------", "-------------------"); + for (int i = 0; i < num_impls; i++) { + double speedup = times[0] / times[i]; + printf("%-25s %12.6f %15.2fx\n", + implementations[i].name, times[i], speedup); + } + + // Cleanup + free(input); + for (int i = 0; i < num_impls; i++) { + free(outputs[i]); + } + free(outputs); + free(times); + + return 0; +} +``` + +This implementation includes the core exponential function logic with SVE intrinsics, along with benchmarking code to measure the performance difference. + +## Compile and run the benchmark + +Compile the program with SVE support enabled: + +```bash +gcc -O3 -march=armv8-a+sve exp_sve.c -o exp_sve -lm +``` + +Run the benchmark to see the performance characteristics: -/* Compute poly(r) = exp(r) - 1 */ -svfloat32_t p12 = svmla_lane(svdup_f32(c1), r, lane_consts, 2); // c1 + c2 * r -svfloat32_t p34 = svmla_lane(svdup_f32(c3), r, lane_consts, 3); // c3 + c4 * r -svfloat32_t r2 = svmul_x(pg, r, r); -svfloat32_t p14 = svmla_x(pg, p12, p34, r2); // c1 + c2 * r + c3 * r^2 + c4 * r^3 -svfloat32_t p0 = svmul_lane(r, lane_consts, 1); // c0 * r -svfloat32_t poly = svmla_x(pg, p0, r2, p14); // c0 * r + c1 * r^2 + c2 * r^3 + c3 * r^4 + c4 * r^5 +```bash +./exp_sve +``` + +The output is similar to: + +```output +Benchmarking exponential function with 1000000 elements... +Running 100 iterations for accuracy + +Sample accuracy check (first 5 values): + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=-4638380449307299824564292406489382912.000000 (error=4.64e+36) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=-4638419429563256842618388430112948224.000000 (error=4.64e+36) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=-4638458409819213860672484453736513536.000000 (error=4.64e+36) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=-4638497706987820935783930851535880192.000000 (error=4.64e+36) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=-4638537004156428010895377249335246848.000000 (error=4.64e+36) -/* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */ -svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); +Performance Results: +Implementation Time (sec) Speedup vs Baseline +------------- ----------- ------------------- +Baseline (expf) 0.002485 1.00x +SVE (degree-4 poly) 0.000570 4.36x ``` + +The benchmark demonstrates the performance benefit of using SVE intrinsics for vectorized exponential computation. You should see a noticeable speedup compared to the standard library's scalar `expf()` function, with typical speedups ranging from 1.5x to 4x depending on your system's SVE vector length and memory bandwidth. + +The accuracy check confirms that the polynomial approximation maintains high precision, with errors typically in the range of 10^-9 to 10^-10 for single-precision floating-point values. + +Continue to the next section to dive into the FEXPA intrinsic implementation, providing further performance uplifts. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md b/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md index 935edc55b7..269aab1bd3 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md @@ -60,3 +60,4 @@ As previously discussed, this value can be approximated by computing a 32-bit in $$i = 2^{23} \times x⁄ln2 + 2^{23} \times bias = a \times x + b $$ +Continue to the next section to make a C-based implementation of the exponential function. \ No newline at end of file From 85d695b990d176085843322d19905835493b2dab Mon Sep 17 00:00:00 2001 From: Annie Tallund Date: Wed, 17 Dec 2025 15:05:56 -0800 Subject: [PATCH 2/5] Fix SVE exp scaling and FEXPA path correctness - Build 2^k via integer exponent bits - Eliminate NaNs and restore numerical correctness --- .../fexpa/fexpa.md | 32 ++++++++----------- .../fexpa/implementation.md | 23 +++++++------ 2 files changed, 27 insertions(+), 28 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md index b4d84508da..2bc44168de 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md @@ -55,35 +55,29 @@ Open your `exp_sve.c` file and add the following function after the `exp_sve()` ```C // SVE exponential implementation with FEXPA (degree-2 polynomial) void exp_sve_fexpa(float *x, float *y, size_t n) { + const float shift_fexpa = 196735.0f; float constants[1] = {ln2_lo}; size_t i = 0; - + svbool_t pg = svptrue_b32(); - + while (i < n) { pg = svwhilelt_b32((uint64_t)i, (uint64_t)n); svfloat32_t x_vec = svld1(pg, &x[i]); - svfloat32_t lane_consts = svld1rq(pg, constants); - /* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */ - svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x_vec, shift); - svfloat32_t k = svsub_x(pg, z, shift); + svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x_vec, svdup_f32(shift_fexpa)); + svfloat32_t k = svsub_x(pg, z, svdup_f32(shift_fexpa)); - /* Compute r as x - k*ln2 with Cody and Waite */ svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x_vec); - r = svmls_lane(r, k, lane_consts, 0); + r = svmls_lane_f32(r, k, lane_consts, 0); - /* Compute the scaling factor 2^k using FEXPA */ svfloat32_t scale = svexpa(svreinterpret_u32_f32(z)); - /* Compute poly(r) = exp(r) - 1 (degree-2 polynomial) */ - svfloat32_t p01 = svmla_x(pg, svdup_f32(c0), r, svdup_f32(c1)); // c0 + c1 * r - svfloat32_t poly = svmul_x(pg, r, p01); // r * (c0 + c1 * r) + svfloat32_t p01 = svmla_x(pg, svdup_f32(c0), r, svdup_f32(c1)); + svfloat32_t poly = svmul_x(pg, r, p01); - /* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */ svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); - svst1(pg, &y[i], result); i += svcntw(); } @@ -128,11 +122,11 @@ The output shows the final comparison: ```output Performance Results: -Implementation Time (sec) Speedup vs Baseline -------------- ----------- ------------------- -Baseline (expf) 0.004523 1.00x -SVE (degree-4 poly) 0.002187 4.36x -SVE+FEXPA (degree-2) 0.001453 6.09x +Implementation Time (sec) Speedup vs Baseline +------------- ----------- ------------------- +Baseline (expf) 0.002458 1.00x +SVE (degree-4 poly) 0.000640 3.84x +SVE+FEXPA (degree-2) 0.000414 5.94x ``` ## Results analysis diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md index 62d5cc70b2..f2231259a5 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md @@ -68,8 +68,13 @@ void exp_sve(float *x, float *y, size_t n) { r = svmls_lane(r, k, lane_consts, 0); /* Compute the scaling factor 2^k */ - svfloat32_t scale = svreinterpret_f32_u32( - svlsl_n_u32_m(pg, svreinterpret_u32_f32(z), 23)); + svint32_t ki = svcvt_s32_f32_x(pg, k); // float k -> int + svint32_t ei = svadd_n_s32_x(pg, ki, 127); // add exponent bias + svuint32_t ebits = svlsl_n_u32_x( + pg, + svreinterpret_u32_s32(ei), + 23); + svfloat32_t scale = svreinterpret_f32_u32(ebits); /* Compute poly(r) = exp(r) - 1 */ svfloat32_t p12 = svmla_lane(svdup_f32(c1), r, lane_consts, 2); // c1 + c2 * r @@ -205,17 +210,17 @@ Benchmarking exponential function with 1000000 elements... Running 100 iterations for accuracy Sample accuracy check (first 5 values): - x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=-4638380449307299824564292406489382912.000000 (error=4.64e+36) - x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=-4638419429563256842618388430112948224.000000 (error=4.64e+36) - x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=-4638458409819213860672484453736513536.000000 (error=4.64e+36) - x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=-4638497706987820935783930851535880192.000000 (error=4.64e+36) - x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=-4638537004156428010895377249335246848.000000 (error=4.64e+36) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006815 (error=7.75e-05) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006816 (error=7.75e-05) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006816 (error=7.75e-05) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006816 (error=7.75e-05) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006816 (error=7.75e-05) Performance Results: Implementation Time (sec) Speedup vs Baseline ------------- ----------- ------------------- -Baseline (expf) 0.002485 1.00x -SVE (degree-4 poly) 0.000570 4.36x +Baseline (expf) 0.002496 1.00x +SVE (degree-4 poly) 0.000633 3.94x ``` The benchmark demonstrates the performance benefit of using SVE intrinsics for vectorized exponential computation. You should see a noticeable speedup compared to the standard library's scalar `expf()` function, with typical speedups ranging from 1.5x to 4x depending on your system's SVE vector length and memory bandwidth. From 946d3249b7d36186662bfcd579d183349a140bf0 Mon Sep 17 00:00:00 2001 From: Annie Tallund Date: Thu, 18 Dec 2025 16:17:03 -0800 Subject: [PATCH 3/5] Address feedback in FEXPA LP --- .../fexpa/_index.md | 4 ++ .../fexpa/conclusion.md | 6 +-- .../fexpa/fexpa.md | 33 +++++++----- .../fexpa/implementation.md | 51 ++++++++++--------- 4 files changed, 55 insertions(+), 39 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md b/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md index 2daed651d0..afb2605be6 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md @@ -23,6 +23,10 @@ author: - Alexandre Romana further_reading: + - resource: + title: Arm Optimized Routines + link: https://github.com/ARM-software/optimized-routines + type: website - resource: title: Scalable Vector Extensions documentation link: https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md b/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md index 626f2ea3d1..9f62a198aa 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md @@ -7,12 +7,12 @@ layout: learningpathall --- ## Conclusion -The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation. +The SVE FEXPA instruction can speed-up the computation of the exponential functions by implementing table lookup and bit manipulation. The exponential function is the core of the Softmax function that, with the shift toward Generative AI, has become a critical component of modern neural network architectures. -In conclusion, SME-support for FEXPA lets you embed the expensive exponential approximation directly into the matrix computation path. That translates into: +An implementation of the exponential function based on FEXPA can achieve a specified target precision using a polynomial of lower degree than that required by alternative implementations. Moreover, SME support for FEXPA lets you embed the exponential approximation directly into the matrix computation path and that translates into: - Fewer instructions (no back-and-forth to scalar/SVE code) - Potentially higher aggregate throughput (more exponentials per cycle) -- Lower power & bandwidth (data being kept in SME engine) +- Lower power & bandwidth (data being kept in the SME engine) - Cleaner fusion with GEMM/GEMV workloads All of which makes all exponential heavy workloads significantly faster on ARM CPUs. diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md index 2bc44168de..211f1ebd27 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md @@ -10,15 +10,15 @@ layout: learningpathall Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator. -Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of _Idxb_, _Expb_ and _Remb_ bits. +Let’s segment the IEEE 754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of _Idxb_, _Expb_ and _Remb_ bits. -| IEEE754 precision | Idxb | Expb | Remb | +| IEEE 754 precision | Idxb | Expb | Remb | |-------------------------|------|------|------| | Half precision (FP16) | 5 | 5 | 0 | | Single precision (FP32) | 6 | 8 | 9 | | Double precision (FP64) | 6 | 11 | 35 | -The FEXPA instruction can be described for any real number x ∈ [2^(remBits + expBits) + 1; 2^(remBits + expBits) + 2^expBits - 1) as: +The FEXPA instruction can be described for any real number x ∈ [2^(Remb + Expb) + 1; 2^(Remb + Expb) + 2^Expb - 1) as: $$FEXPA(x)=2^{x-constant}$$ @@ -55,16 +55,20 @@ Open your `exp_sve.c` file and add the following function after the `exp_sve()` ```C // SVE exponential implementation with FEXPA (degree-2 polynomial) void exp_sve_fexpa(float *x, float *y, size_t n) { - const float shift_fexpa = 196735.0f; - float constants[1] = {ln2_lo}; + // FEXPA-specific coeffs + const float c0_fexpa = 1.000003695487976f; // 0x1.00003ep0 + const float c1_fexpa = 0.5000003576278687f; // 0x1.00000cp-1 + const float shift_fexpa = 196735.0f; // 1.5*2^(23-6) + 127 + + float constants[1] = { ln2_lo }; size_t i = 0; - svbool_t pg = svptrue_b32(); + const svbool_t p_all = svptrue_b32(); + svfloat32_t lane_consts = svld1rq(p_all, constants); // load once while (i < n) { - pg = svwhilelt_b32((uint64_t)i, (uint64_t)n); + const svbool_t pg = svwhilelt_b32((uint64_t)i, (uint64_t)n); svfloat32_t x_vec = svld1(pg, &x[i]); - svfloat32_t lane_consts = svld1rq(pg, constants); svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x_vec, svdup_f32(shift_fexpa)); svfloat32_t k = svsub_x(pg, z, svdup_f32(shift_fexpa)); @@ -74,7 +78,7 @@ void exp_sve_fexpa(float *x, float *y, size_t n) { svfloat32_t scale = svexpa(svreinterpret_u32_f32(z)); - svfloat32_t p01 = svmla_x(pg, svdup_f32(c0), r, svdup_f32(c1)); + svfloat32_t p01 = svmla_x(pg, svdup_f32(c0_fexpa), r, svdup_f32(c1_fexpa)); svfloat32_t poly = svmul_x(pg, r, p01); svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); @@ -84,6 +88,11 @@ void exp_sve_fexpa(float *x, float *y, size_t n) { } ``` +{{% notice Note %}} +This implementation can be found in [ARM Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/ba35b32/math/aarch64/sve/sv_expf_inline.h). +{{% /notice %}} + + Now register this new implementation in the `implementations` array in `main()`. Find this section: ```C @@ -124,9 +133,9 @@ The output shows the final comparison: Performance Results: Implementation Time (sec) Speedup vs Baseline ------------- ----------- ------------------- -Baseline (expf) 0.002458 1.00x -SVE (degree-4 poly) 0.000640 3.84x -SVE+FEXPA (degree-2) 0.000414 5.94x +Baseline (expf) 0.002462 1.00x +SVE (degree-4 poly) 0.000578 4.26x +SVE+FEXPA (degree-2) 0.000414 5.95x ``` ## Results analysis diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md index f2231259a5..1f529d5b42 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md @@ -28,16 +28,20 @@ Create a new file named `exp_sve.c` with the following implementation: #include #include -// Constants for exponential computation -static const float ln2_hi = 0.693359375f; -static const float ln2_lo = -2.12194440e-4f; -static const float inv_ln2 = 1.4426950216f; -static const float shift = 12582912.0f; // 1.5*2^23 + 127 -static const float c0 = 0.999999995f; -static const float c1 = 1.00000000f; -static const float c2 = 0.499991506f; -static const float c3 = 0.166676521f; -static const float c4 = 0.0416637054f; +static const float c0 = 0.9999997019767761f; // 0x1.fffff6p-1 +static const float c1 = 0.4999915063381195f; // 0x1.fffdc6p-2 +static const float c2 = 0.16667652130126953f; // 0x1.555a8p-3 +static const float c3 = 0.04189782217144966f; // 0x1.573a1ap-5 +static const float c4 = 0.008289290592074394f; // 0x1.0f9f9cp-7 + +// Range reduction constants +static const float ln2_val = 0.6931471824645996f; // 0x1.62e43p-1 +static const float ln2_hi = 0.693145751953125f; // 0x1.62e4p-1f +static const float ln2_lo = 1.428606765330187e-06f; // 0x1.7f7d1cp-20f +static const float inv_ln2 = 1.4426950216293335f; // 0x1.715476p+0f + +// Shift +static const float shift = 12583039.0f; // 0x1.8000fep23f // Baseline exponential using standard library void exp_baseline(float *x, float *y, size_t n) { @@ -68,13 +72,7 @@ void exp_sve(float *x, float *y, size_t n) { r = svmls_lane(r, k, lane_consts, 0); /* Compute the scaling factor 2^k */ - svint32_t ki = svcvt_s32_f32_x(pg, k); // float k -> int - svint32_t ei = svadd_n_s32_x(pg, ki, 127); // add exponent bias - svuint32_t ebits = svlsl_n_u32_x( - pg, - svreinterpret_u32_s32(ei), - 23); - svfloat32_t scale = svreinterpret_f32_u32(ebits); + svfloat32_t scale = svreinterpret_f32_u32(svlsl_n_u32_m(pg, svreinterpret_u32_f32(z), 23)); /* Compute poly(r) = exp(r) - 1 */ svfloat32_t p12 = svmla_lane(svdup_f32(c1), r, lane_consts, 2); // c1 + c2 * r @@ -187,6 +185,11 @@ int main() { } ``` +{{% notice Additional optimization %}} +You can reduce constant loads further by packing them into a single SVE register, as done in [Arm Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/1931794/pl/math/sv_expf_2u.c). +{{% /notice %}} + + This implementation includes the core exponential function logic with SVE intrinsics, along with benchmarking code to measure the performance difference. ## Compile and run the benchmark @@ -210,17 +213,17 @@ Benchmarking exponential function with 1000000 elements... Running 100 iterations for accuracy Sample accuracy check (first 5 values): - x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006815 (error=7.75e-05) - x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006816 (error=7.75e-05) - x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006816 (error=7.75e-05) - x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006816 (error=7.75e-05) - x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006816 (error=7.75e-05) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006738 (error=4.66e-10) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006738 (error=4.66e-10) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006738 (error=4.66e-10) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006738 (error=9.31e-10) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006738 (error=9.31e-10) Performance Results: Implementation Time (sec) Speedup vs Baseline ------------- ----------- ------------------- -Baseline (expf) 0.002496 1.00x -SVE (degree-4 poly) 0.000633 3.94x +Baseline (expf) 0.002640 1.00x +SVE (degree-4 poly) 0.000576 4.58x ``` The benchmark demonstrates the performance benefit of using SVE intrinsics for vectorized exponential computation. You should see a noticeable speedup compared to the standard library's scalar `expf()` function, with typical speedups ranging from 1.5x to 4x depending on your system's SVE vector length and memory bandwidth. From 7924003eb560774f73eddd79c165abb41f4bb09d Mon Sep 17 00:00:00 2001 From: Annie Tallund Date: Fri, 19 Dec 2025 13:43:46 -0800 Subject: [PATCH 4/5] Address more comments in FEXPA LP --- .../servers-and-cloud-computing/fexpa/fexpa.md | 18 +++++++++++------- .../fexpa/implementation.md | 12 ++++++------ 2 files changed, 17 insertions(+), 13 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md index 211f1ebd27..7214ef2fb6 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md @@ -60,35 +60,39 @@ void exp_sve_fexpa(float *x, float *y, size_t n) { const float c1_fexpa = 0.5000003576278687f; // 0x1.00000cp-1 const float shift_fexpa = 196735.0f; // 1.5*2^(23-6) + 127 - float constants[1] = { ln2_lo }; size_t i = 0; - const svbool_t p_all = svptrue_b32(); - svfloat32_t lane_consts = svld1rq(p_all, constants); // load once + const svfloat32_t ln2lo_vec = svdup_f32(ln2_lo); while (i < n) { const svbool_t pg = svwhilelt_b32((uint64_t)i, (uint64_t)n); svfloat32_t x_vec = svld1(pg, &x[i]); + /* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */ svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x_vec, svdup_f32(shift_fexpa)); svfloat32_t k = svsub_x(pg, z, svdup_f32(shift_fexpa)); + /* Compute r as x - k*ln2 with Cody and Waite */ svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x_vec); - r = svmls_lane_f32(r, k, lane_consts, 0); + r = svmls_lane_f32(r, k, ln2lo_vec, 0); + /* Compute the scaling factor 2^k using FEXPA */ svfloat32_t scale = svexpa(svreinterpret_u32_f32(z)); - svfloat32_t p01 = svmla_x(pg, svdup_f32(c0_fexpa), r, svdup_f32(c1_fexpa)); - svfloat32_t poly = svmul_x(pg, r, p01); + /* Compute poly(r) = exp(r) - 1 (degree-2 polynomial) */ + svfloat32_t p01 = svmla_x(pg, svdup_f32(c0_fexpa), r, svdup_f32(c1_fexpa)); // c0 + c1 * r + svfloat32_t poly = svmul_x(pg, r, p01); // r * (c0 + c1 * r) + /* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */ svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); + svst1(pg, &y[i], result); i += svcntw(); } } ``` -{{% notice Note %}} +{{% notice Arm Optimized Routines %}} This implementation can be found in [ARM Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/ba35b32/math/aarch64/sve/sv_expf_inline.h). {{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md index 1f529d5b42..725c98c78b 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md @@ -55,10 +55,11 @@ void exp_sve(float *x, float *y, size_t n) { float constants[4] = {ln2_lo, c0, c2, c4}; size_t i = 0; - svbool_t pg = svptrue_b32(); + const svbool_t p_all = svptrue_b32(); + const svfloat32_t lane_consts = svld1rq(p_all, constants); while (i < n) { - pg = svwhilelt_b32((uint64_t)i, (uint64_t)n); + const svbool_t pg = svwhilelt_b32((uint64_t)i, (uint64_t)n); svfloat32_t x_vec = svld1(pg, &x[i]); svfloat32_t lane_consts = svld1rq(pg, constants); @@ -185,12 +186,11 @@ int main() { } ``` -{{% notice Additional optimization %}} -You can reduce constant loads further by packing them into a single SVE register, as done in [Arm Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/1931794/pl/math/sv_expf_2u.c). +{{% notice Arm Optimized Routines %}} +You can find this implementation in [Arm Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/1931794/pl/math/sv_expf_2u.c). {{% /notice %}} - -This implementation includes the core exponential function logic with SVE intrinsics, along with benchmarking code to measure the performance difference. +This implementation includes the core exponential function logic with SVE intrinsics, along with benchmarking code to measure the performance difference. By packing constant loads into a single SVE register, the code reduces memory traffic and enables efficient lane-based access to individual constants throughout the computation. ## Compile and run the benchmark From 1ccdf23e0d48c5500621204f1ebfaab53f4999d4 Mon Sep 17 00:00:00 2001 From: Annie Tallund Date: Mon, 22 Dec 2025 09:27:43 -0800 Subject: [PATCH 5/5] Small variable fix in FEXPA LP --- .../learning-paths/servers-and-cloud-computing/fexpa/fexpa.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md index 7214ef2fb6..52d7531a47 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md @@ -58,7 +58,7 @@ void exp_sve_fexpa(float *x, float *y, size_t n) { // FEXPA-specific coeffs const float c0_fexpa = 1.000003695487976f; // 0x1.00003ep0 const float c1_fexpa = 0.5000003576278687f; // 0x1.00000cp-1 - const float shift_fexpa = 196735.0f; // 1.5*2^(23-6) + 127 + const float shift_fexpa = 196735.0f; // 0x1.803f8p17f size_t i = 0;