Optimize gcm_gf_mult using PCLMULQDQ and PMULL by sjlombardo · Pull Request #714 · libtom/libtomcrypt

sjlombardo · 2026-04-02T18:46:21Z

The current gcm_gf_mult table optimization is great for large payloads, but it is less well-suited for small ones. For example, encrypting the same total volume of data in 4 KB chunks vs. a single 1 MB message is 3-10x slower with the table-optimized gcm_gf_mult implementation. This creates a bottleneck for programs that process many small messages with different keys.

Unfortunately disabling LTC_GCM_TABLES introduces the reverse problem. It speeds things up considerably for small messages but degrades performance for larger ones.

Using hardware carry-less multiply solves these problems for both cases. This PR implements multiplication for GCM using Intel PCLMULQDQ on x86/x86_64 and PMULL on ARM AArch64. Benchmarking shows it is consistently 3-10x faster across messages between 1K and 4K. Hardware acceleration scales better under load, performs predictably regardless of payload size, and makes key setup faster. On balance, this eliminates the performance tradeoff. If there is interest, additional benchmark details can be provided.

Hardware support for this approach is extensive. The required instructions for x86/x86_64 have been around for over 10 years. Similarly, AArch64 support in ARMv8+crypto is available in all reasonably modern devices including most phones, desktops, and server hardware.

The core implementation is based on well-understood and referenced papers:

From Intel Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode (Gueron & Kounavis), algorithms 1 and 5 (the implementation follows the paper directly and is attributed in the source code comments)
From Implementing GCM on ARMv8 (Gouvea & Lopez), algorithms 3 and 5

In order to minimize changes, the existing gcm_gf_mult is renamed to gcm_gf_mult_sw internally and a very simple gcm_gf_mult dispatcher sits in front of it. It establishes whether there is hardware support and routes to the appropriate implementation or the software fallback. There are no public API changes to LibTomCrypt.

CPU detection follows the same pattern as the recent AES-NI work on develop using CPUID on x86/x86_64, getauxval on ARM Linux, and sysctlbyname on ARM Apple.

The new functions use __attribute__((target(...))) under GCC and Clang to avoid global -mpclmul or +crypto compiler flag changes to the makefile(s). Compile-time opt-out is possible with LTC_NO_GCM_PCLMUL or LTC_NO_GCM_PMULL. Both are disabled under LTC_NO_ASM. If neither of the previous macros is set then LTC_GCM_TABLES is automatically disabled since the intrinsic path is preferable.

It wasn't necessary to add tests because the existing GCM test suite covers this adequately and there is no API change. The test suite was run and passed on Linux x86_64, Linux AArch64, Windows x86_64, Windows ARM64, macOS x86_64, and macOS AArch64.

sjaeckel

Thanks a lot for the contribution! Pretty impressive improvements.

I took the liberty to rebase the commit and do the necessary modifications to pass ./helper.pl -a and added a CI job for CMake.

I ran the timing demo on some variations with the following results on x86_64:

$ make timing CFLAGS="-DLTC_NO_GCM_PCLMUL -DLTC_NO_TABLES"
[...]
$ ./timing encmac |& grep -e GCM -e Timing
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 1KB blocks):
GCM (no-precomp)               19
GCM (precomp)                  19
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 4KB blocks):
GCM (no-precomp)               19
GCM (precomp)                  18
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 32KB blocks):
GCM (no-precomp)               18
GCM (precomp)                  18
$ make timing CFLAGS="-DLTC_NO_GCM_PCLMUL"
[...]
$ ./timing encmac |& grep -e GCM -e Timing
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 1KB blocks):
GCM (no-precomp)               96
GCM (precomp)                   7
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 4KB blocks):
GCM (no-precomp)               27
GCM (precomp)                   6
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 32KB blocks):
GCM (no-precomp)                8
GCM (precomp)                   5
$ make timing
[...]
$ ./timing encmac |& grep -e GCM -e Timing
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 1KB blocks):
GCM (no-precomp)                6
GCM (precomp)                   5
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 4KB blocks):
GCM (no-precomp)                5
GCM (precomp)                   5
ENC+MAC Timings (zero byte AAD, 16 byte IV, cycles/byte on 32KB blocks):
GCM (no-precomp)                5
GCM (precomp)                   5

GCM table setup disproportionately hurts LTC performance with small messages. Disabling LTC_GCM_TABLES helps for small payloads but hurts large ones. This implements hardware carry-less multiplication for GCM that performs well for both cases using PCLMULQDQ on x86/x86_64 and PMULL on AArch64. There are no public API changes. These features can be disabled with LTC_NO_GCM_PCLMUL, LTC_NO_GCM_PMULL, or LTC_NO_ASM. LTC_GCM_TABLES is disabled automatically when no opt-out macro is set.

Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>

sjaeckel mentioned this pull request Apr 7, 2026

Add CMake & arm64 tests to CI #715

Merged

sjaeckel force-pushed the develop branch from 9faedf3 to 61ac5dc Compare April 7, 2026 14:41

sjaeckel approved these changes Apr 7, 2026

View reviewed changes

sjlombardo and others added 2 commits April 7, 2026 16:51

Add CMake CI testrun w/o ASM

93eed8d

Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>

sjaeckel force-pushed the develop branch from 61ac5dc to 93eed8d Compare April 7, 2026 14:51

sjaeckel merged commit ecc3686 into libtom:develop Apr 7, 2026
470 of 475 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize gcm_gf_mult using PCLMULQDQ and PMULL#714

Optimize gcm_gf_mult using PCLMULQDQ and PMULL#714
sjaeckel merged 2 commits intolibtom:developfrom
sjlombardo:develop

sjlombardo commented Apr 2, 2026

Uh oh!

sjaeckel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sjlombardo commented Apr 2, 2026

Uh oh!

sjaeckel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants