diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/_index.md b/content/learning-paths/servers-and-cloud-computing/gcc-lto/_index.md new file mode 100644 index 0000000000..636be8b748 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/_index.md @@ -0,0 +1,44 @@ +--- +title: LTO Optimization With GCC + +minutes_to_complete: 10 + +who_is_this_for: This is an introductory topic for developers wishing to optimize code performance via link-time optimization using the GCC toolchain. + +learning_objectives: + - Understand the key concepts behind LTO + - Understand how to employ the optimization in GCC + - Develop some intuition as to the potential performance gains achievable + +prerequisites: + - A recent release of the GCC toolchain + +author: Victor Do Nascimento, Arm + +### Tags +skilllevels: Introductory +subjects: Compiler Optimization +armips: + - Neoverse + - Cortex-A +tools_software_languages: + - GCC +operatingsystems: + - Linux + +further_reading: + - resource: + title: GCC Wiki + link: https://gcc.gnu.org/wiki/LinkTimeOptimization + type: website + - resource: + title: Gentoo Wiki + link: https://wiki.gentoo.org/wiki/LTO + type: website + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/_index.md~ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/_index.md~ new file mode 100644 index 0000000000..8d4b491d17 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/_index.md~ @@ -0,0 +1,52 @@ +--- +title: LTO Optimization With GCC + +minutes_to_complete: 10 + +who_is_this_for: This is an introductory topic for developers wishing to optimize code performance via link-time optimization using the GCC toolchain. + +learning_objectives: + - PLACEHOLDER OBJECTIVE 1 + - PLACEHOLDER OBJECTIVE 2 + +prerequisites: + - PLACEHOLDER PREREQ 1 + - PLACEHOLDER PREREQ 2 + +author: Victor Do Nascimento, Arm + +### Tags +skilllevels: Introductory +subjects: Compiler Optimization +armips: + - PLACEHOLDER IP A + - PLACEHOLDER IP B +tools_software_languages: + - GCC +operatingsystems: + - Linux + + + +further_reading: + - resource: + title: PLACEHOLDER MANUAL + link: PLACEHOLDER MANUAL LINK + type: documentation + - resource: + title: PLACEHOLDER BLOG + link: PLACEHOLDER BLOG LINK + type: blog + - resource: + title: GCC Wiki + link: https://gcc.gnu.org/wiki/LinkTimeOptimization + type: website + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/gcc-lto/_next-steps.md new file mode 100644 index 0000000000..727b395ddd --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # The weight controls the order of the pages. _index.md always has weight 1. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/background.md b/content/learning-paths/servers-and-cloud-computing/gcc-lto/background.md new file mode 100644 index 0000000000..b23b1b094d --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/background.md @@ -0,0 +1,29 @@ +--- +title: An LTO Primer +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## A Brief Introduction to Link-Time Optimization + +### Optimizations and Their Scope of Operation +Different optimizations carried out by the compiler may be categorized by the scope within which they operate. + +Some optimizations such as dead code elimination operate within the limited scope of a single function. If some defined variable is known not to be used within its scope, it can be thrown away without knowledge of what the program does outside of its scope. + +Others, however, require knowledge of the rest of the code. A function known to be called with a constant as one if its parameters is likely to benefit from inter-procedural constant propagation, for example. Any such optimization must, however, be conservative. + +A function not visible outside of the file in which it is defined - its translational unit - will have enough information by default for the compiler to make such decisions. For a function exposed in a dynamically-linked library, on the other hand, it is impossible to make compile-time conclusions such as those required for inter-procedural constant propagation. Consequently, such optimizations cannot be made. + +Between these two extremes lie functions defined for use throughout a program's various components, but whose use will be fully defined once the final executable is generated. + +It is for these cases that link-time optimization (LTO) is able to provide executables with maximal performance gains. + +### Link-Time Optimization and Intermediate Code Representation +Typically, when a translational unit is finished compiling, GCC emits an object file - A binary object containing a largely complete section of executable code minus potentially unresolved symbols, together with data and metadata needed for the final linking. This having committed to particular instructions and thrown away the compiler's intermediate representation greatly reduces any ability the compiler might have had to further optimize the code when objects are combined into the final executable. + +Given the loss of optimization opportunities that comes with the move to a particular assembly sequence, requesting LTO on a single compilation unit causes GCC to alter the format of output following the completion of compilation, retaining an intermediate representation of the code. GCC dumps its internal representation (GIMPLE) to disk as bytecode, so that all the different compilation units that make up a single executable can later be optimized as a single module. + +Once all the different LTO-enabled object files have been emitted, link time optimization can be executed. Link time optimization is implemented as a GCC front end for a bytecode representation of GIMPLE that is emitted in special sections of `.o` files. diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/background.md~ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/background.md~ new file mode 100644 index 0000000000..6a432a3606 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/background.md~ @@ -0,0 +1,19 @@ +--- +title: An LTO Primer +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## A Brief Introduction to Link-Time Optimization + +Different optimizations carried out by the compiler may be categorized by the scope within which they operate. + +Some optimizations such as dead code elimination operate within the limited scope of a single function. If some defined variable is known not to be used within its scope, it can be thrown away without knowlege of what the program does outside of its scope. + +Others, however, require knowlege of the rest of the code. A function known to be called with a constant as one if its parameters is likely to benefit from interprocedural constant propagation, for example. Any such optimization must, however, be conservative. + +A function not visible outside of the file in which it is defined - its translational unit - will have enough information by default for the compiler to make such decisions. For a function exposed in a library, on the other hand, it is impossible to make conclusions such as how it will be called at compile-time and such optimizations cannot be made. Between these two extremes lie functions defined for use throughout a program's various components, but whose use will be fully defined once the final executable is defined. + +It is for these cases that link-time optimization (LTO) is able to provide executables with maximal performance gains. diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/example-picture.png b/content/learning-paths/servers-and-cloud-computing/gcc-lto/example-picture.png new file mode 100644 index 0000000000..c69844bed4 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gcc-lto/example-picture.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/performance-uplift.md b/content/learning-paths/servers-and-cloud-computing/gcc-lto/performance-uplift.md new file mode 100644 index 0000000000..bf9c7a3051 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/performance-uplift.md @@ -0,0 +1,36 @@ +--- +title: Potential Gains +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Comparing Performance + +The potential benefits to be gained from the use of LTO can be highlighted via performance comparison of the Specint2017 benchmark run on a Neoverse V2 CPU, compiled with and without LTO using GCC 15.1. + +There was an uplift in the geometric mean of scores across different benchmarks, wherein we see an improvement of ~3.4%, with the biggest winners being`gmcf` (+11%), `deepsjeng` (9.9%), `leela` (6.6%). + +![SPECint LTO performance gains#center](specint_lto_improv.png "Figure 1. Performance uplift to Specint2017") + +### Code-size Considerations + +As demonstrated above the overall performance of many executables is greatly improved by the optimization, but this is not the only obeservable gain to be had as a consequence of the optimization. + +As shown in figure 2, the use of LTO can have considerable impact on the final code size of the resulting executable. + +![SPECint LTO code size reduction#center](specint_lto_size.png "Figure 2. Code size reduction to Specint2017") + +#### Potential Code Size Reduction + +One example where LTO can lead to a decrease in code size is cross-translation-unit dead code elimination, made possible by the global visibility of functions and variables and their uses in an executable. Without link-time information, non-`static` functions and variables are treated conservatively and kept around in the binary, in case of uses at link-time. With LTO, a final decision can be made and unused functions and variables eliminated. + +#### Potential Code Size Increase + +While the this global visibility of the code can often lead to a shrinking of the resulting binary, other choices deemed profitable by the compler can lead to an increase in code size. For example: + +- Knowing a loop will execute `n` times in particular instances may lead to more loop unrolling than otherwise. +- Knowing a function regularly calls another (smaller) function may cause the compiler to inline the callee into the caller's body. + +While all these decisions inherently lead to an increase in code size it is worth noting that while, just like inter-procedural constant propagation mentioned earlier, these transformations may be valid and beneficial in 90% of a function's use, we must retain compatibility with the remaining 10% of use cases. In order for the compiler to optimize functions as per highly-recurrent use cases, it makes clones of the functions it wishes to transform such that the original function form is still present for use in less frequent cases. Where this is done, the resultant code duplication can further increase code size. diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/performance-uplift.md~ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/performance-uplift.md~ new file mode 100644 index 0000000000..791aa6c4d1 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/performance-uplift.md~ @@ -0,0 +1,28 @@ +--- +title: Deploying LTO +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Deploying LTO + +### A simple use-case +To rely on GCC's default configuration for link-time optimization, using the feature is as simple as passing gcc the `-flto` flag when invoking it from the command line. + +For the step-wise build of an executable, we'd have: +```sh +gcc -c -O2 -flto foo.c +gcc -c -O2 -flto bar.c +gcc -o myprog -flto -O2 foo.o bar.o +``` +This could be simplified to a one-liner, as follows: +```sh +gcc -o myprog -flto -O2 foo.c bar.c +``` + +### Modifying LTO behaviour +Link-time optimization may be sped up by execution in parallel - `-flto=` allows us to manually specify the desired number of parallel jobs or `-flto=auto` can be used for automatic parallelization. + +During code development, it is possible to cache the outputs of translational units inside LTO, thus significantly shortening edit-compile cycles. This can be achieved using the `-flto-incremental=` flag. diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/request-lto.md b/content/learning-paths/servers-and-cloud-computing/gcc-lto/request-lto.md new file mode 100644 index 0000000000..b273c41d32 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/request-lto.md @@ -0,0 +1,38 @@ +--- +title: Deploying LTO +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Deploying LTO + +### A simple use-case +To rely on GCC's default configuration for link-time optimization, using the feature is as simple as passing gcc the `-flto` flag when invoking it from the command line. + +For the step-wise build of an executable, we'd have: +```sh +gcc -c -O2 -flto component-1.c +gcc -c -O2 -flto component-2.c +gcc -o myprog -flto -O2 component-1.o component-2.o +``` +This could be simplified to a one-liner, as follows: +```sh +gcc -o myprog -flto -O2 component-1.c component-2.c +``` + +### Modifying LTO behaviour +#### Flexible object files + +By default, requesting `-flto` when compiling individual object files to be linked later, we are effectively committing to using LTO every time the object is to be linked into an executable. As such the resulting object files contain only GCC's internal intermediate representation of the code. Such objects are referred to as being _slim_. + +This constraint can be relaxed and _fat_ LTO-enabled objects generated, as can be achieved using the `-ffat-lto-objects` flag. Using this flag causes the final object binary contents that would be generated in the absence of LTO to be emitted alongside intermediate bytecode and can be useful for compatibility purposes. + +#### Parallelization +Link-time optimization may be sped up by execution in parallel. This behavior can be controlled by augmenting the `-flto` flag with an argument. + +While `-flto=auto` can be used for automatic parallelization, `-flto=` allows us to manually specify the desired number of parallel jobs. Requesting parallelization causes the whole program to be split into multiple partitions of similar size, with the compiler trying to minimize the number of references which cross partition boundaries and which would otherwise lead to missed optimizations. + +#### Caching +During code development, it is possible to cache the outputs of translational units inside LTO, thus significantly shortening edit-compile cycles. This can be achieved using the `-flto-incremental=` flag. diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/request-lto.md~ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/request-lto.md~ new file mode 100644 index 0000000000..4246bbeed8 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gcc-lto/request-lto.md~ @@ -0,0 +1,27 @@ +--- +title: An LTO Primer +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## A Brief Introduction to Link-Time Optimization + +### Optimizations and Their Scope of Operation +Different optimizations carried out by the compiler may be categorized by the scope within which they operate. + +Some optimizations such as dead code elimination operate within the limited scope of a single function. If some defined variable is known not to be used within its scope, it can be thrown away without knowledge of what the program does outside of its scope. + +Others, however, require knowledge of the rest of the code. A function known to be called with a constant as one if its parameters is likely to benefit from inter-procedural constant propagation, for example. Any such optimization must, however, be conservative. + +A function not visible outside of the file in which it is defined - its translational unit - will have enough information by default for the compiler to make such decisions. For a function exposed in a library, on the other hand, it is impossible to make conclusions such as how it will be called at compile-time and such optimizations cannot be made. Between these two extremes lie functions defined for use throughout a program's various components, but whose use will be fully defined once the final executable is defined. + +It is for these cases that link-time optimization (LTO) is able to provide executables with maximal performance gains. + +### Link-Time Optimization and Intermediate Code Representation +Typically, when a translational unit is finished compiling, GCC emits an object file - A binary object containing a largely complete section of executable code minus potentially unresolved symbols, together with data and metadata needed for the final linking. This having committed to particular instructions and thrown away the compiler's intermediate representation greatly reduces any ability the compiler might have had to further optimize the code when objects are combined into the final executable. + +Given the loss of optimization opportunities that comes with the move to a particular assembly sequence, requesting LTO causes GCC to alter the format of output following the completion of compiling the translational unit. To quote the GCC Wiki, the use of LTO for the single compilation unit causes GCC to dump its internal representation (GIMPLE) to disk as bytecode, so that all the different compilation units that make up a single executable can be optimized as a single module. + +Once all the different lto-enabled object files have been emitted, link time optimization can be executed. Link time optimization is implemented as a GCC front end for a bytecode representation of GIMPLE that is emitted in special sections of .o files. diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/specint_lto_improv.png b/content/learning-paths/servers-and-cloud-computing/gcc-lto/specint_lto_improv.png new file mode 100644 index 0000000000..941749db64 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gcc-lto/specint_lto_improv.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gcc-lto/specint_lto_size.png b/content/learning-paths/servers-and-cloud-computing/gcc-lto/specint_lto_size.png new file mode 100644 index 0000000000..c2aee7b471 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gcc-lto/specint_lto_size.png differ