diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md b/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md index 4f64d1d84..afb2605be 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md @@ -10,11 +10,11 @@ minutes_to_complete: 15 who_is_this_for: This is an introductory topic for developers interested in implementing the exponential function and optimizing it. The Scalable Vector Extension (SVE), introduced with the Armv8-A architecture, includes a dedicated instruction, FEXPA. Although initially not supported in SME, the FEXPA instruction has been made available in Scalable Matrix Extension (SME) version 2.2. learning_objectives: - - Implementing with SVE intrinsics the exponential function - - Optimizing it with FEXPA + - Implement the exponential function using SVE intrinsics + - Optimize the function with FEXPA prerequisites: - - An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers. + - Access to an [AWS Graviton4, Google Axion, or Azure Cobalt 100 virtual machine from a cloud service provider](/learning-paths/servers-and-cloud-computing/csp/). - Some familiarity with SIMD programming and SVE intrinsics. author: @@ -23,6 +23,10 @@ author: - Alexandre Romana further_reading: + - resource: + title: Arm Optimized Routines + link: https://github.com/ARM-software/optimized-routines + type: website - resource: title: Scalable Vector Extensions documentation link: https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md b/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md index 626f2ea3d..9f62a198a 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md @@ -7,12 +7,12 @@ layout: learningpathall --- ## Conclusion -The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation. +The SVE FEXPA instruction can speed-up the computation of the exponential functions by implementing table lookup and bit manipulation. The exponential function is the core of the Softmax function that, with the shift toward Generative AI, has become a critical component of modern neural network architectures. -In conclusion, SME-support for FEXPA lets you embed the expensive exponential approximation directly into the matrix computation path. That translates into: +An implementation of the exponential function based on FEXPA can achieve a specified target precision using a polynomial of lower degree than that required by alternative implementations. Moreover, SME support for FEXPA lets you embed the exponential approximation directly into the matrix computation path and that translates into: - Fewer instructions (no back-and-forth to scalar/SVE code) - Potentially higher aggregate throughput (more exponentials per cycle) -- Lower power & bandwidth (data being kept in SME engine) +- Lower power & bandwidth (data being kept in the SME engine) - Cleaner fusion with GEMM/GEMV workloads All of which makes all exponential heavy workloads significantly faster on ARM CPUs. diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md index 80bb34b75..52d7531a4 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md @@ -10,9 +10,9 @@ layout: learningpathall Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator. -Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits. +Let’s segment the IEEE 754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of _Idxb_, _Expb_ and _Remb_ bits. -| IEEE754 precision | Idxb | Expb | Remb | +| IEEE 754 precision | Idxb | Expb | Remb | |-------------------------|------|------|------| | Half precision (FP16) | 5 | 5 | 0 | | Single precision (FP32) | 6 | 8 | 9 | @@ -45,26 +45,112 @@ $$ e^x = 2^m \times T[j] \times (1 + p(r)) $$ With a table of size 2^L, the evaluation interval for the approximation polynomial is narrowed by a factor of 2^L. This reduction leads to improved accuracy for a given polynomial degree due to the narrower approximation range. Alternatively, for a given accuracy target, the degree of the polynomial—and hence its computational complexity—can be reduced. ## Exponential implementation with FEXPA -FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy of the implementation we have seen before: + +FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy as the degree-4 polynomial implementation from the previous section. + +### Add the FEXPA implementation + +Open your `exp_sve.c` file and add the following function after the `exp_sve()` function: + +```C +// SVE exponential implementation with FEXPA (degree-2 polynomial) +void exp_sve_fexpa(float *x, float *y, size_t n) { + // FEXPA-specific coeffs + const float c0_fexpa = 1.000003695487976f; // 0x1.00003ep0 + const float c1_fexpa = 0.5000003576278687f; // 0x1.00000cp-1 + const float shift_fexpa = 196735.0f; // 0x1.803f8p17f + + size_t i = 0; + + const svfloat32_t ln2lo_vec = svdup_f32(ln2_lo); + + while (i < n) { + const svbool_t pg = svwhilelt_b32((uint64_t)i, (uint64_t)n); + svfloat32_t x_vec = svld1(pg, &x[i]); + + /* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */ + svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x_vec, svdup_f32(shift_fexpa)); + svfloat32_t k = svsub_x(pg, z, svdup_f32(shift_fexpa)); + + /* Compute r as x - k*ln2 with Cody and Waite */ + svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x_vec); + r = svmls_lane_f32(r, k, ln2lo_vec, 0); + + /* Compute the scaling factor 2^k using FEXPA */ + svfloat32_t scale = svexpa(svreinterpret_u32_f32(z)); + + /* Compute poly(r) = exp(r) - 1 (degree-2 polynomial) */ + svfloat32_t p01 = svmla_x(pg, svdup_f32(c0_fexpa), r, svdup_f32(c1_fexpa)); // c0 + c1 * r + svfloat32_t poly = svmul_x(pg, r, p01); // r * (c0 + c1 * r) + + /* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */ + svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); + + svst1(pg, &y[i], result); + i += svcntw(); + } +} +``` + +{{% notice Arm Optimized Routines %}} +This implementation can be found in [ARM Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/ba35b32/math/aarch64/sve/sv_expf_inline.h). +{{% /notice %}} + + +Now register this new implementation in the `implementations` array in `main()`. Find this section: + +```C + exp_impl_t implementations[] = { + {"Baseline (expf)", exp_baseline}, + {"SVE (degree-4 poly)", exp_sve}, + // Add more implementations here as you develop them + }; +``` + +Add your FEXPA implementation to the array: ```C -svfloat32_t lane_consts = svld1rq(pg, ln2_lo); // Load only ln2_lo + exp_impl_t implementations[] = { + {"Baseline (expf)", exp_baseline}, + {"SVE (degree-4 poly)", exp_sve}, + {"SVE+FEXPA (degree-2)", exp_sve_fexpa}, + }; +``` + +## Compile and compare -/* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */ -svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x, shift); -svfloat32_t k = svsub_x(pg, z, shift); +Recompile the program: -/* Compute r as x - k*ln2 with Cody and Waite */ -svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x); - r = svmls_lane(r, k, lane_consts, 0); +```bash +gcc -O3 -march=armv8-a+sve exp_sve.c -o exp_sve -lm +``` + +Run the benchmark: -/* Compute the scaling factor 2^k */ -svfloat32_t scale = svexpa(svreinterpret_u32(z)); +```bash +./exp_sve +``` -/* Compute poly(r) = exp(r) - 1 (2nd degree polynomial) */ -svfloat32_t p01 = svmla_x (pg, svdup_f32(c0), r, svdup_f32(c1)); // c0 + c1 * r -svfloat32_t poly = svmul_x (pg, r, p01); // r c0 + c1 * r^2 +The output shows the final comparison: -/* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */ -svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); +```output +Performance Results: +Implementation Time (sec) Speedup vs Baseline +------------- ----------- ------------------- +Baseline (expf) 0.002462 1.00x +SVE (degree-4 poly) 0.000578 4.26x +SVE+FEXPA (degree-2) 0.000414 5.95x ``` + +## Results analysis + +The benchmark shows the performance progression: + +1. **SVE with degree-4 polynomial**: Provides up to 4x speedup through vectorization +2. **SVE with FEXPA and degree-2 polynomial**: Achieves an additional 1-2x improvement + +The FEXPA instruction delivers this improvement by: +- Replacing manual bit manipulation with a single hardware instruction (`svexpa()`) +- Enabling a simpler polynomial (degree-2 instead of degree-4) while maintaining accuracy + +Both SVE implementations maintain comparable accuracy (errors in the 10^-9 to 10^-10 range), demonstrating that specialized hardware instructions can significantly improve performance without sacrificing precision. diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md index 016a4705b..725c98c78 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md @@ -6,31 +6,228 @@ weight: 3 layout: learningpathall --- -## First implementation -Given what we said in the previous chapters, the exponential function can be implemented with SVE intrinsics in the following way: +## Implement the exponential function + +Based on the theory covered in the previous section, you can implement the exponential function using SVE intrinsics with polynomial approximation. This Learning Path was tested using a AWS Graviton4 instance type `r8g.medium`. + +## Set up your environment + +To run the example, you will need `gcc`. + +```bash +sudo apt update +sudo apt -y install gcc +``` + +Create a new file named `exp_sve.c` with the following implementation: ```C -svfloat32_t lane_consts = svld1rq(pg, constants); // Load ln2_lo, c0, c2, c4 in register +#include +#include +#include +#include +#include + +static const float c0 = 0.9999997019767761f; // 0x1.fffff6p-1 +static const float c1 = 0.4999915063381195f; // 0x1.fffdc6p-2 +static const float c2 = 0.16667652130126953f; // 0x1.555a8p-3 +static const float c3 = 0.04189782217144966f; // 0x1.573a1ap-5 +static const float c4 = 0.008289290592074394f; // 0x1.0f9f9cp-7 + +// Range reduction constants +static const float ln2_val = 0.6931471824645996f; // 0x1.62e43p-1 +static const float ln2_hi = 0.693145751953125f; // 0x1.62e4p-1f +static const float ln2_lo = 1.428606765330187e-06f; // 0x1.7f7d1cp-20f +static const float inv_ln2 = 1.4426950216293335f; // 0x1.715476p+0f -/* Compute k as round(x/ln2) using shift = 1.5*2^23 + 127 */ -svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x, shift); -svfloat32_t k = svsub_x(pg, z, shift); +// Shift +static const float shift = 12583039.0f; // 0x1.8000fep23f -/* Compute r as x - k*ln2 with Cody and Waite */ -svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x); - r = svmls_lane(r, k, lane_consts, 0); +// Baseline exponential using standard library +void exp_baseline(float *x, float *y, size_t n) { + for (size_t i = 0; i < n; i++) { + y[i] = expf(x[i]); + } +} -/* Compute the scaling factor 2^k */ -svfloat32_t scale = svreinterpret_f32_u32(svlsl_n_u32_m(pg, svreinterpret_u32_f32(z), 23)); +// SVE exponential implementation +void exp_sve(float *x, float *y, size_t n) { + float constants[4] = {ln2_lo, c0, c2, c4}; + size_t i = 0; + + const svbool_t p_all = svptrue_b32(); + const svfloat32_t lane_consts = svld1rq(p_all, constants); + + while (i < n) { + const svbool_t pg = svwhilelt_b32((uint64_t)i, (uint64_t)n); + svfloat32_t x_vec = svld1(pg, &x[i]); + + svfloat32_t lane_consts = svld1rq(pg, constants); -/* Compute poly(r) = exp(r) - 1 */ -svfloat32_t p12 = svmla_lane(svdup_f32(c1), r, lane_consts, 2); // c1 + c2 * r -svfloat32_t p34 = svmla_lane(svdup_f32(c3), r, lane_consts, 3); // c3 + c4 * r -svfloat32_t r2 = svmul_x(pg, r, r); -svfloat32_t p14 = svmla_x(pg, p12, p34, r2); // c1 + c2 * r + c3 * r^2 + c4 * r^3 -svfloat32_t p0 = svmul_lane(r, lane_consts, 1); // c0 * r -svfloat32_t poly = svmla_x(pg, p0, r2, p14); // c0 * r + c1 * r^2 + c2 * r^3 + c3 * r^4 + c4 * r^5 + /* Compute k as round(x/ln2) using shift = 1.5*2^23 + 127 */ + svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x_vec, shift); + svfloat32_t k = svsub_x(pg, z, shift); -/* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */ -svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); + /* Compute r as x - k*ln2 with Cody and Waite */ + svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x_vec); + r = svmls_lane(r, k, lane_consts, 0); + + /* Compute the scaling factor 2^k */ + svfloat32_t scale = svreinterpret_f32_u32(svlsl_n_u32_m(pg, svreinterpret_u32_f32(z), 23)); + + /* Compute poly(r) = exp(r) - 1 */ + svfloat32_t p12 = svmla_lane(svdup_f32(c1), r, lane_consts, 2); // c1 + c2 * r + svfloat32_t p34 = svmla_lane(svdup_f32(c3), r, lane_consts, 3); // c3 + c4 * r + svfloat32_t r2 = svmul_x(pg, r, r); + svfloat32_t p14 = svmla_x(pg, p12, p34, r2); // c1 + c2 * r + c3 * r^2 + c4 * r^3 + svfloat32_t p0 = svmul_lane(r, lane_consts, 1); // c0 * r + svfloat32_t poly = svmla_x(pg, p0, r2, p14); // c0 * r + c1 * r^2 + c2 * r^3 + c3 * r^4 + c4 * r^5 + + /* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */ + svfloat32_t result = svmla_f32_x(pg, scale, poly, scale); + + svst1(pg, &y[i], result); + i += svcntw(); + } +} + +// Benchmark function +double benchmark(void (*func)(float*, float*, size_t), + float *input, float *output, size_t n, int iterations) { + struct timespec start, end; + + clock_gettime(CLOCK_MONOTONIC, &start); + for (int i = 0; i < iterations; i++) { + func(input, output, n); + } + clock_gettime(CLOCK_MONOTONIC, &end); + + double elapsed = (end.tv_sec - start.tv_sec) + + (end.tv_nsec - start.tv_nsec) / 1e9; + return elapsed / iterations; +} + +// Structure to hold implementation info +typedef struct { + const char *name; + void (*func)(float*, float*, size_t); +} exp_impl_t; + +int main() { + const size_t n = 1000000; + const int iterations = 100; + + // List all available implementations here + // Add new implementations to this array to automatically benchmark them + exp_impl_t implementations[] = { + {"Baseline (expf)", exp_baseline}, + {"SVE (degree-4 poly)", exp_sve}, + // Add more implementations here as you develop them + }; + int num_impls = sizeof(implementations) / sizeof(implementations[0]); + + // Allocate aligned memory + float *input = aligned_alloc(64, n * sizeof(float)); + float **outputs = malloc(num_impls * sizeof(float*)); + for (int i = 0; i < num_impls; i++) { + outputs[i] = aligned_alloc(64, n * sizeof(float)); + } + + // Initialize input with test values + for (size_t i = 0; i < n; i++) { + input[i] = -5.0f + (10.0f * i) / n; // Range: [-5, 5] + } + + printf("Benchmarking exponential function with %zu elements...\n", n); + printf("Running %d iterations for accuracy\n\n", iterations); + + // Benchmark all implementations + double *times = malloc(num_impls * sizeof(double)); + for (int i = 0; i < num_impls; i++) { + times[i] = benchmark(implementations[i].func, input, outputs[i], n, iterations); + } + + // Verify accuracy + printf("Sample accuracy check (first 5 values):\n"); + for (int i = 0; i < 5; i++) { + printf(" x=%.2f: ", input[i]); + for (int j = 0; j < num_impls; j++) { + if (j > 0) { + float error = fabsf(outputs[j][i] - outputs[0][i]); + printf("%s=%.6f (error=%.2e)", + implementations[j].name, outputs[j][i], error); + } else { + printf("%s=%.6f", implementations[j].name, outputs[j][i]); + } + if (j < num_impls - 1) printf(", "); + } + printf("\n"); + } + + // Display performance results + printf("\nPerformance Results:\n"); + printf("%-25s %12s %15s\n", "Implementation", "Time (sec)", "Speedup vs Baseline"); + printf("%-25s %12s %15s\n", "-------------", "-----------", "-------------------"); + for (int i = 0; i < num_impls; i++) { + double speedup = times[0] / times[i]; + printf("%-25s %12.6f %15.2fx\n", + implementations[i].name, times[i], speedup); + } + + // Cleanup + free(input); + for (int i = 0; i < num_impls; i++) { + free(outputs[i]); + } + free(outputs); + free(times); + + return 0; +} ``` + +{{% notice Arm Optimized Routines %}} +You can find this implementation in [Arm Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/1931794/pl/math/sv_expf_2u.c). +{{% /notice %}} + +This implementation includes the core exponential function logic with SVE intrinsics, along with benchmarking code to measure the performance difference. By packing constant loads into a single SVE register, the code reduces memory traffic and enables efficient lane-based access to individual constants throughout the computation. + +## Compile and run the benchmark + +Compile the program with SVE support enabled: + +```bash +gcc -O3 -march=armv8-a+sve exp_sve.c -o exp_sve -lm +``` + +Run the benchmark to see the performance characteristics: + +```bash +./exp_sve +``` + +The output is similar to: + +```output +Benchmarking exponential function with 1000000 elements... +Running 100 iterations for accuracy + +Sample accuracy check (first 5 values): + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006738 (error=4.66e-10) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006738 (error=4.66e-10) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006738 (error=4.66e-10) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006738 (error=9.31e-10) + x=-5.00: Baseline (expf)=0.006738, SVE (degree-4 poly)=0.006738 (error=9.31e-10) + +Performance Results: +Implementation Time (sec) Speedup vs Baseline +------------- ----------- ------------------- +Baseline (expf) 0.002640 1.00x +SVE (degree-4 poly) 0.000576 4.58x +``` + +The benchmark demonstrates the performance benefit of using SVE intrinsics for vectorized exponential computation. You should see a noticeable speedup compared to the standard library's scalar `expf()` function, with typical speedups ranging from 1.5x to 4x depending on your system's SVE vector length and memory bandwidth. + +The accuracy check confirms that the polynomial approximation maintains high precision, with errors typically in the range of 10^-9 to 10^-10 for single-precision floating-point values. + +Continue to the next section to dive into the FEXPA intrinsic implementation, providing further performance uplifts. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md b/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md index 935edc55b..269aab1bd 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md @@ -60,3 +60,4 @@ As previously discussed, this value can be approximated by computing a 32-bit in $$i = 2^{23} \times x⁄ln2 + 2^{23} \times bias = a \times x + b $$ +Continue to the next section to make a C-based implementation of the exponential function. \ No newline at end of file