Skip to content

Optimize gcm_gf_mult using PCLMULQDQ and PMULL#714

Merged
sjaeckel merged 2 commits intolibtom:developfrom
sjlombardo:develop
Apr 7, 2026
Merged

Optimize gcm_gf_mult using PCLMULQDQ and PMULL#714
sjaeckel merged 2 commits intolibtom:developfrom
sjlombardo:develop

Conversation

@sjlombardo
Copy link
Copy Markdown
Contributor

The current gcm_gf_mult table optimization is great for large payloads, but it is less well-suited for small ones. For example, encrypting the same total volume of data in 4 KB chunks vs. a single 1 MB message is 3-10x slower with the table-optimized gcm_gf_mult implementation. This creates a bottleneck for programs that process many small messages with different keys.

Unfortunately disabling LTC_GCM_TABLES introduces the reverse problem. It speeds things up considerably for small messages but degrades performance for larger ones.

Using hardware carry-less multiply solves these problems for both cases. This PR implements multiplication for GCM using Intel PCLMULQDQ on x86/x86_64 and PMULL on ARM AArch64. Benchmarking shows it is consistently 3-10x faster across messages between 1K and 4K. Hardware acceleration scales better under load, performs predictably regardless of payload size, and makes key setup faster. On balance, this eliminates the performance tradeoff. If there is interest, additional benchmark details can be provided.

Hardware support for this approach is extensive. The required instructions for x86/x86_64 have been around for over 10 years. Similarly, AArch64 support in ARMv8+crypto is available in all reasonably modern devices including most phones, desktops, and server hardware.

The core implementation is based on well-understood and referenced papers:

In order to minimize changes, the existing gcm_gf_mult is renamed to gcm_gf_mult_sw internally and a very simple gcm_gf_mult dispatcher sits in front of it. It establishes whether there is hardware support and routes to the appropriate implementation or the software fallback. There are no public API changes to LibTomCrypt.

CPU detection follows the same pattern as the recent AES-NI work on develop using CPUID on x86/x86_64, getauxval on ARM Linux, and sysctlbyname on ARM Apple.

The new functions use __attribute__((target(...))) under GCC and Clang to avoid global -mpclmul or +crypto compiler flag changes to the makefile(s). Compile-time opt-out is possible with LTC_NO_GCM_PCLMUL or LTC_NO_GCM_PMULL. Both are disabled under LTC_NO_ASM. If neither of the previous macros is set then LTC_GCM_TABLES is automatically disabled since the intrinsic path is preferable.

It wasn't necessary to add tests because the existing GCM test suite covers this adequately and there is no API change. The test suite was run and passed on Linux x86_64, Linux AArch64, Windows x86_64, Windows ARM64, macOS x86_64, and macOS AArch64.

Copy link
Copy Markdown
Member

@sjaeckel sjaeckel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the contribution! Pretty impressive improvements.

I took the liberty to rebase the commit and do the necessary modifications to pass ./helper.pl -a and added a CI job for CMake.

I ran the timing demo on some variations with the following results on x86_64:

$ make timing CFLAGS="-DLTC_NO_GCM_PCLMUL -DLTC_NO_TABLES"
[...]
$ ./timing encmac |& grep -e GCM -e Timing
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 1KB blocks):
GCM (no-precomp)               19
GCM (precomp)                  19
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 4KB blocks):
GCM (no-precomp)               19
GCM (precomp)                  18
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 32KB blocks):
GCM (no-precomp)               18
GCM (precomp)                  18
$ make timing CFLAGS="-DLTC_NO_GCM_PCLMUL"
[...]
$ ./timing encmac |& grep -e GCM -e Timing
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 1KB blocks):
GCM (no-precomp)               96
GCM (precomp)                   7
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 4KB blocks):
GCM (no-precomp)               27
GCM (precomp)                   6
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 32KB blocks):
GCM (no-precomp)                8
GCM (precomp)                   5
$ make timing
[...]
$ ./timing encmac |& grep -e GCM -e Timing
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 1KB blocks):
GCM (no-precomp)                6
GCM (precomp)                   5
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 4KB blocks):
GCM (no-precomp)                5
GCM (precomp)                   5
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 32KB blocks):
GCM (no-precomp)                5
GCM (precomp)                   5

sjlombardo and others added 2 commits April 7, 2026 16:51
GCM table setup disproportionately hurts LTC performance
with small messages. Disabling LTC_GCM_TABLES helps for
small payloads but hurts large ones.

This implements hardware carry-less multiplication for GCM that
performs well for both cases using PCLMULQDQ on x86/x86_64 and
PMULL on AArch64.

There are no public API changes.

These features can be disabled with LTC_NO_GCM_PCLMUL,
LTC_NO_GCM_PMULL, or LTC_NO_ASM. LTC_GCM_TABLES is disabled
automatically when no opt-out macro is set.
Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>
@sjaeckel sjaeckel merged commit ecc3686 into libtom:develop Apr 7, 2026
470 of 475 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants