Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: LTO Optimization With GCC

minutes_to_complete: 10

who_is_this_for: This is an introductory topic for developers wishing to optimize code performance via link-time optimization using the GCC toolchain.

learning_objectives:
- Understand the key concepts behind LTO
- Understand how to employ the optimization in GCC
- Develop some intuition as to the potential performance gains achievable

prerequisites:
- A recent release of the GCC toolchain

author: Victor Do Nascimento, Arm

### Tags
skilllevels: Introductory
subjects: Compiler Optimization
armips:
- Neoverse
- Cortex-A
tools_software_languages:
- GCC
operatingsystems:
- Linux

further_reading:
- resource:
title: GCC Wiki
link: https://gcc.gnu.org/wiki/LinkTimeOptimization
type: website
- resource:
title: Gentoo Wiki
link: https://wiki.gentoo.org/wiki/LTO
type: website

### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: LTO Optimization With GCC

minutes_to_complete: 10

who_is_this_for: This is an introductory topic for developers wishing to optimize code performance via link-time optimization using the GCC toolchain.

learning_objectives:
- PLACEHOLDER OBJECTIVE 1
- PLACEHOLDER OBJECTIVE 2

prerequisites:
- PLACEHOLDER PREREQ 1
- PLACEHOLDER PREREQ 2

author: Victor Do Nascimento, Arm

### Tags
skilllevels: Introductory
subjects: Compiler Optimization
armips:
- PLACEHOLDER IP A
- PLACEHOLDER IP B
tools_software_languages:
- GCC
operatingsystems:
- Linux



further_reading:
- resource:
title: PLACEHOLDER MANUAL
link: PLACEHOLDER MANUAL LINK
type: documentation
- resource:
title: PLACEHOLDER BLOG
link: PLACEHOLDER BLOG LINK
type: blog
- resource:
title: GCC Wiki
link: https://gcc.gnu.org/wiki/LinkTimeOptimization
type: website



### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # The weight controls the order of the pages. _index.md always has weight 1.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: An LTO Primer
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## A Brief Introduction to Link-Time Optimization

### Optimizations and Their Scope of Operation
Different optimizations carried out by the compiler may be categorized by the scope within which they operate.

Some optimizations such as dead code elimination operate within the limited scope of a single function. If some defined variable is known not to be used within its scope, it can be thrown away without knowledge of what the program does outside of its scope.

Others, however, require knowledge of the rest of the code. A function known to be called with a constant as one if its parameters is likely to benefit from inter-procedural constant propagation, for example. Any such optimization must, however, be conservative.

A function not visible outside of the file in which it is defined - its translational unit - will have enough information by default for the compiler to make such decisions. For a function exposed in a dynamically-linked library, on the other hand, it is impossible to make compile-time conclusions such as those required for inter-procedural constant propagation. Consequently, such optimizations cannot be made.

Between these two extremes lie functions defined for use throughout a program's various components, but whose use will be fully defined once the final executable is generated.

It is for these cases that link-time optimization (LTO) is able to provide executables with maximal performance gains.

### Link-Time Optimization and Intermediate Code Representation
Typically, when a translational unit is finished compiling, GCC emits an object file - A binary object containing a largely complete section of executable code minus potentially unresolved symbols, together with data and metadata needed for the final linking. This having committed to particular instructions and thrown away the compiler's intermediate representation greatly reduces any ability the compiler might have had to further optimize the code when objects are combined into the final executable.

Given the loss of optimization opportunities that comes with the move to a particular assembly sequence, requesting LTO on a single compilation unit causes GCC to alter the format of output following the completion of compilation, retaining an intermediate representation of the code. GCC dumps its internal representation (GIMPLE) to disk as bytecode, so that all the different compilation units that make up a single executable can later be optimized as a single module.

Once all the different LTO-enabled object files have been emitted, link time optimization can be executed. Link time optimization is implemented as a GCC front end for a bytecode representation of GIMPLE that is emitted in special sections of `.o` files.
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
title: An LTO Primer
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## A Brief Introduction to Link-Time Optimization

Different optimizations carried out by the compiler may be categorized by the scope within which they operate.

Some optimizations such as dead code elimination operate within the limited scope of a single function. If some defined variable is known not to be used within its scope, it can be thrown away without knowlege of what the program does outside of its scope.

Others, however, require knowlege of the rest of the code. A function known to be called with a constant as one if its parameters is likely to benefit from interprocedural constant propagation, for example. Any such optimization must, however, be conservative.

A function not visible outside of the file in which it is defined - its translational unit - will have enough information by default for the compiler to make such decisions. For a function exposed in a library, on the other hand, it is impossible to make conclusions such as how it will be called at compile-time and such optimizations cannot be made. Between these two extremes lie functions defined for use throughout a program's various components, but whose use will be fully defined once the final executable is defined.

It is for these cases that link-time optimization (LTO) is able to provide executables with maximal performance gains.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
title: Potential Gains
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Comparing Performance

The potential benefits to be gained from the use of LTO can be highlighted via performance comparison of the Specint2017 benchmark run on a Neoverse V2 CPU, compiled with and without LTO using GCC 15.1.

There was an uplift in the geometric mean of scores across different benchmarks, wherein we see an improvement of ~3.4%, with the biggest winners being`gmcf` (+11%), `deepsjeng` (9.9%), `leela` (6.6%).

![SPECint LTO performance gains#center](specint_lto_improv.png "Figure 1. Performance uplift to Specint2017")

### Code-size Considerations

As demonstrated above the overall performance of many executables is greatly improved by the optimization, but this is not the only obeservable gain to be had as a consequence of the optimization.

As shown in figure 2, the use of LTO can have considerable impact on the final code size of the resulting executable.

![SPECint LTO code size reduction#center](specint_lto_size.png "Figure 2. Code size reduction to Specint2017")

#### Potential Code Size Reduction

One example where LTO can lead to a decrease in code size is cross-translation-unit dead code elimination, made possible by the global visibility of functions and variables and their uses in an executable. Without link-time information, non-`static` functions and variables are treated conservatively and kept around in the binary, in case of uses at link-time. With LTO, a final decision can be made and unused functions and variables eliminated.

#### Potential Code Size Increase

While the this global visibility of the code can often lead to a shrinking of the resulting binary, other choices deemed profitable by the compler can lead to an increase in code size. For example:

- Knowing a loop will execute `n` times in particular instances may lead to more loop unrolling than otherwise.
- Knowing a function regularly calls another (smaller) function may cause the compiler to inline the callee into the caller's body.

While all these decisions inherently lead to an increase in code size it is worth noting that while, just like inter-procedural constant propagation mentioned earlier, these transformations may be valid and beneficial in 90% of a function's use, we must retain compatibility with the remaining 10% of use cases. In order for the compiler to optimize functions as per highly-recurrent use cases, it makes clones of the functions it wishes to transform such that the original function form is still present for use in less frequent cases. Where this is done, the resultant code duplication can further increase code size.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: Deploying LTO
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Deploying LTO

### A simple use-case
To rely on GCC's default configuration for link-time optimization, using the feature is as simple as passing gcc the `-flto` flag when invoking it from the command line.

For the step-wise build of an executable, we'd have:
```sh
gcc -c -O2 -flto foo.c
gcc -c -O2 -flto bar.c
gcc -o myprog -flto -O2 foo.o bar.o
```
This could be simplified to a one-liner, as follows:
```sh
gcc -o myprog -flto -O2 foo.c bar.c
```

### Modifying LTO behaviour
Link-time optimization may be sped up by execution in parallel - `-flto=<nparallel>` allows us to manually specify the desired number of parallel jobs or `-flto=auto` can be used for automatic parallelization.

During code development, it is possible to cache the outputs of translational units inside LTO, thus significantly shortening edit-compile cycles. This can be achieved using the `-flto-incremental=<path>` flag.
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: Deploying LTO
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Deploying LTO

### A simple use-case
To rely on GCC's default configuration for link-time optimization, using the feature is as simple as passing gcc the `-flto` flag when invoking it from the command line.

For the step-wise build of an executable, we'd have:
```sh
gcc -c -O2 -flto component-1.c
gcc -c -O2 -flto component-2.c
gcc -o myprog -flto -O2 component-1.o component-2.o
```
This could be simplified to a one-liner, as follows:
```sh
gcc -o myprog -flto -O2 component-1.c component-2.c
```

### Modifying LTO behaviour
#### Flexible object files

By default, requesting `-flto` when compiling individual object files to be linked later, we are effectively committing to using LTO every time the object is to be linked into an executable. As such the resulting object files contain only GCC's internal intermediate representation of the code. Such objects are referred to as being _slim_.

This constraint can be relaxed and _fat_ LTO-enabled objects generated, as can be achieved using the `-ffat-lto-objects` flag. Using this flag causes the final object binary contents that would be generated in the absence of LTO to be emitted alongside intermediate bytecode and can be useful for compatibility purposes.

#### Parallelization
Link-time optimization may be sped up by execution in parallel. This behavior can be controlled by augmenting the `-flto` flag with an argument.

While `-flto=auto` can be used for automatic parallelization, `-flto=<nthread>` allows us to manually specify the desired number of parallel jobs. Requesting parallelization causes the whole program to be split into multiple partitions of similar size, with the compiler trying to minimize the number of references which cross partition boundaries and which would otherwise lead to missed optimizations.

#### Caching
During code development, it is possible to cache the outputs of translational units inside LTO, thus significantly shortening edit-compile cycles. This can be achieved using the `-flto-incremental=<path>` flag.
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: An LTO Primer
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## A Brief Introduction to Link-Time Optimization

### Optimizations and Their Scope of Operation
Different optimizations carried out by the compiler may be categorized by the scope within which they operate.

Some optimizations such as dead code elimination operate within the limited scope of a single function. If some defined variable is known not to be used within its scope, it can be thrown away without knowledge of what the program does outside of its scope.

Others, however, require knowledge of the rest of the code. A function known to be called with a constant as one if its parameters is likely to benefit from inter-procedural constant propagation, for example. Any such optimization must, however, be conservative.

A function not visible outside of the file in which it is defined - its translational unit - will have enough information by default for the compiler to make such decisions. For a function exposed in a library, on the other hand, it is impossible to make conclusions such as how it will be called at compile-time and such optimizations cannot be made. Between these two extremes lie functions defined for use throughout a program's various components, but whose use will be fully defined once the final executable is defined.

It is for these cases that link-time optimization (LTO) is able to provide executables with maximal performance gains.

### Link-Time Optimization and Intermediate Code Representation
Typically, when a translational unit is finished compiling, GCC emits an object file - A binary object containing a largely complete section of executable code minus potentially unresolved symbols, together with data and metadata needed for the final linking. This having committed to particular instructions and thrown away the compiler's intermediate representation greatly reduces any ability the compiler might have had to further optimize the code when objects are combined into the final executable.

Given the loss of optimization opportunities that comes with the move to a particular assembly sequence, requesting LTO causes GCC to alter the format of output following the completion of compiling the translational unit. To quote the GCC Wiki, the use of LTO for the single compilation unit causes GCC to dump its internal representation (GIMPLE) to disk as bytecode, so that all the different compilation units that make up a single executable can be optimized as a single module.

Once all the different lto-enabled object files have been emitted, link time optimization can be executed. Link time optimization is implemented as a GCC front end for a bytecode representation of GIMPLE that is emitted in special sections of .o files.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.