perf: allocate Normalize buffers dynamically, store in AT by jodavies · Pull Request #803 · form-dev/form

jodavies · 2026-02-26T12:05:47Z

Normalize is always a large contributor to FORM's run time. Profiling reveals that the large stack allocations in this function are costly: since NORMSIZE is 1000, they total ~90KB.

I tried running the usual benchmarks with NORMSIZE set to 100, which is sufficient for these tests, and the performance difference is rather large.

Since we can't just reduce NORMSIZE without (in principle) breaking user scripts which have very complicated terms, I experimented a bit:

using dynamic allocations slows things down significantly
using permanent arrays in, say, AT doesn't work: I think Normalize is called recursively
using part of the WorkSpace, as commentary already suggested, works in principle but breaks various test suite tests which have a tight workspace contraint

I decided that one option is to just make NORMSIZE a user-controlled parameter, with default 1000. Then nothing will break, and users can experiment with making it smaller to speed up their scripts. This PR implements this, by making a "NormSize" setup parameter, and using Variable Length Arrays in Normalize. VLAs are in c99 but optional in c11.

What I don't yet understand, is that I now get the same performance improvement WITHOUT reducing the value from 1000.

Here are the numbers. Can someone try to reproduce this?

Benchmark	Speedup w.r.t. v5.0.0
chromatic	1.10 ± 0.01
color	1.13 ± 0.01
fmft	1.10 ± 0.01
forcer	1.04 ± 0.02
forcer-exp	1.11 ± 0.01
mass-fact	1.21 ± 0.09
mbox1l	1.02 ± 0.03
minceex	1.10 ± 0.02
mincer	1.25 ± 0.08
sort-disk	1.58 ± 0.01
sort-large	2.03 ± 0.07
sort-small	1.83 ± 0.02
trace	1.33 ± 0.01

The github runners seem a bit dodgy recently... I think the CI should pass.

jodavies · 2026-02-26T16:12:40Z

The reason for the performance improvement is due to "stack clash protection". Compiling with -fno-stack-clash-protection also gives the same performance improvement. (Adding additionally -fno-stack-protector is maybe an additional percent or two).

The way it is implemented is to touch each 4k page in the function's stack. This leads to a lot of L1 cache misses -- my CPU has an 8-way associative cache. In the VLA case, stack clash protection is also active, however it touches each 4k page of each array in separate loops, compared to the whole 90k stack in a single loop. Since NORMSIZE is 1000, the base addresses of the arrays are not separated by a power of 2, and so it manages much better L1 usage. If, in the VLA case, I make NORMSIZE 1024, the performance gain disappears.

So: should we just compile by default with -fno-stack-clash-protection -fno-stack-protector? I don't know if this provides anything, security wise, when FORM has #write and #system anyway?

I will also investigate further using the WorkSpace to hold this data.

vermaseren · 2026-02-26T16:46:20Z

The problem with NORMSIZE was in the past, when it had still a much smaller value, running out of these arrays resulted extremely hard to spot bugs. I wanted these arrays in the stack, because that way the compiler can hang it all on a single address register, the frame pointer. That saves speed. This was before the routine became recursive, but also for recursive use, which is not too frequent, this should not make much of a difference. But because I wanted to hang it all on the frame pointer, these arrays had to be fixed size. If you allocate them dynamically by making it a setup parameter, you will need more address registers during execution. Plus that there is a bigger need for address calculations. This consideration of address registers is also the reason for the global struct A. There also things go by offset to the address of the whole structure. You may argue that there are more address registers, but the fewer are already ‘fixed’, the better the compiler can optimise. In the past that made a few percent difference in speed.

…

On 26 Feb 2026, at 17:13, jodavies ***@***.***> wrote: jodavies left a comment (form-dev/form#803) <#803 (comment)> The reason for the performance improvement is due to "stack clash protection". Compiling with -fno-stack-clash-protection also gives the same performance improvement. (Adding additionally -fno-stack-protector is maybe an additional percent or two). The way it is implemented is to touch each 4k page in the stack. This leads to a lot of L1 cache misses -- my CPU has an 8-way associative cache. In the VLA case, stack clash protection is also active, however it touches each 4k page of each array in separate loops, compared to the whole 90k stack in a single loop. Since NORMSIZE is 1000, the base addresses of the arrays are not separated by a power of 2, and so it manages much better L1 usage. If, in the VLA case, I make NORMSIZE 1024, the performance gain disappears. So: should we just compile by default with -fno-stack-clash-protection -fno-stack-protector? I don't know if this provides anything, security wise, when FORM has #write and #system anyway? I will also investigate further using the WorkSpace to hold this data. — Reply to this email directly, view it on GitHub <#803 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJPCEXB5N2RZVH3HLLJIQT4N4LRFAVCNFSM6AAAAACWAPB6B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSNRXGYZDGNJXGI>. You are receiving this because you are subscribed to this thread.

jodavies · 2026-02-26T20:33:14Z

When you programmed this, "stack clash protection" did not exist ;) GCC added it in 2018 (8.0).

You can get the same performance improvement by allocating arrays in AT for Normalize to use. You need to allocate, I think, twice the required size. As far as I can tell the only way to have a recursive Normalize call in the current test suite and benchmarks is Normalize -> ExpandRat -> Normalize. I think this is much simpler than messing around with WorkSpace, since many of the functions which Normalize calls, use the WorkSpace.

So, what seems to be the better solution, which both have similar performance?

Compile by default with -fno-stack-clash-protection. This is simple and doesn't change FORM's behaviour/code at all.
Allocate arrays in AT. This then implies a maximum recursion depth for Normalize. The advantage is that buffer overruns in these arrays will be caught by valgrind. Currently if one overruns a buffer, there is silent corruption of the rest of the stack.

vermaseren · 2026-02-26T21:12:58Z

There is a third option. You allocate it in AT as an array that can be expanded and you keep counters how deep you have gone. Currently we do not go beyond two (I believe), but you never know what the future brings. This option suffers from more computations of addresses, but would be ironclad wrt the future. Hence yu have a structured with all those variables, and in AT a pointer to an array with addresses of those structs. When needed you extend the array, and you never make it smaller (for speed).

…

On 26 Feb 2026, at 21:33, jodavies ***@***.***> wrote: jodavies left a comment (form-dev/form#803) <#803 (comment)> When you programmed this, "stack clash protection" did not exist ;) GCC added it in 2018 (8.0). You can get the same performance improvement by allocating arrays in AT for Normalize to use. You need to allocate, I think, twice the required size. As far as I can tell the only way to have a recursive Normalize call in the current test suite and benchmarks is Normalize -> ExpandRat -> Normalize. I think this is much simpler than messing around with WorkSpace, since many of the functions which Normalize calls, use the WorkSpace. So, what seems to be the better solution, which both have similar performance? Compile by default with -fno-stack-clash-protection. This is simple and doesn't change FORM's behaviour/code at all. Allocate arrays in AT. This then implies a maximum recursion depth for Normalize. The advantage is that buffer overruns in these arrays will be caught by valgrind. Currently if one overruns a buffer, there is silent corruption of the rest of the stack. — Reply to this email directly, view it on GitHub <#803 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJPCEQLUZYB5TI24ZCLKM34N5KCBAVCNFSM6AAAAACWAPB6B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSNRZGA3TKOBVGQ>. You are receiving this because you commented.

Large stack allocations have a performance implication due to stack clash protection. Dynamically allocate the buffers needed by Normalize in AT, instead. This leads to a large performance improvement.

jodavies · 2026-02-27T12:34:14Z

Here is the next iteration. Pointers to Normalize buffers live in a struct, and the space is allocated dynamically such that valgrind can catch errors in their use. I removed the user control of NORMSIZE again.

One set of buffers is allocated at startup (per thread), and another is allocated if necessary.

For now, I made it so that debugging builds Terminate if Normalize is called with more than two recursions, which we don't expect to happen.

The performance numbers are more-or-less the same as the first comment.

coveralls · 2026-02-28T03:34:40Z

coverage: 58.035% (-0.008%) from 58.043%
when pulling 070aaaf on jodavies:user-normsize
into c134010 on form-dev:master.

perf: allocate Normalize buffers dynamically, store in AT

070aaaf

Large stack allocations have a performance implication due to stack clash protection. Dynamically allocate the buffers needed by Normalize in AT, instead. This leads to a large performance improvement.

jodavies force-pushed the user-normsize branch from 3f5cf71 to 070aaaf Compare February 27, 2026 12:31

jodavies changed the title ~~WIP perf: make NORMSIZE a user-controlled setup parameter~~ perf: allocate Normalize buffers dynamically, store in AT Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: allocate Normalize buffers dynamically, store in AT#803

perf: allocate Normalize buffers dynamically, store in AT#803
jodavies wants to merge 1 commit intoform-dev:masterfrom
jodavies:user-normsize

jodavies commented Feb 26, 2026 •

edited

Loading

Uh oh!

jodavies commented Feb 26, 2026 •

edited

Loading

Uh oh!

vermaseren commented Feb 26, 2026 via email

Uh oh!

jodavies commented Feb 26, 2026

Uh oh!

vermaseren commented Feb 26, 2026 via email

Uh oh!

jodavies commented Feb 27, 2026

Uh oh!

coveralls commented Feb 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jodavies commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jodavies commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vermaseren commented Feb 26, 2026 via email

Uh oh!

jodavies commented Feb 26, 2026

Uh oh!

vermaseren commented Feb 26, 2026 via email

Uh oh!

jodavies commented Feb 27, 2026

Uh oh!

coveralls commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jodavies commented Feb 26, 2026 •

edited

Loading

jodavies commented Feb 26, 2026 •

edited

Loading

coveralls commented Feb 28, 2026 •

edited

Loading