perf: allocate Normalize buffers dynamically, store in AT#803
perf: allocate Normalize buffers dynamically, store in AT#803jodavies wants to merge 1 commit intoform-dev:masterfrom
Conversation
|
The reason for the performance improvement is due to "stack clash protection". Compiling with The way it is implemented is to touch each 4k page in the function's stack. This leads to a lot of L1 cache misses -- my CPU has an 8-way associative cache. In the VLA case, stack clash protection is also active, however it touches each 4k page of each array in separate loops, compared to the whole 90k stack in a single loop. Since NORMSIZE is 1000, the base addresses of the arrays are not separated by a power of 2, and so it manages much better L1 usage. If, in the VLA case, I make NORMSIZE 1024, the performance gain disappears. So: should we just compile by default with I will also investigate further using the WorkSpace to hold this data. |
|
The problem with NORMSIZE was in the past, when it had still a much smaller value, running out of these arrays resulted extremely hard to spot bugs. I wanted these arrays in the stack, because that way the compiler can hang it all on a single address register, the frame pointer. That saves speed. This was before the routine became recursive, but also for recursive use, which is not too frequent, this should not make much of a difference.
But because I wanted to hang it all on the frame pointer, these arrays had to be fixed size.
If you allocate them dynamically by making it a setup parameter, you will need more address registers during execution. Plus that there is a bigger need for address calculations.
This consideration of address registers is also the reason for the global struct A. There also things go by offset to the address of the whole structure. You may argue that there are more address registers, but the fewer are already ‘fixed’, the better the compiler can optimise. In the past that made a few percent difference in speed.
… On 26 Feb 2026, at 17:13, jodavies ***@***.***> wrote:
jodavies
left a comment
(form-dev/form#803)
<#803 (comment)>
The reason for the performance improvement is due to "stack clash protection". Compiling with -fno-stack-clash-protection also gives the same performance improvement. (Adding additionally -fno-stack-protector is maybe an additional percent or two).
The way it is implemented is to touch each 4k page in the stack. This leads to a lot of L1 cache misses -- my CPU has an 8-way associative cache. In the VLA case, stack clash protection is also active, however it touches each 4k page of each array in separate loops, compared to the whole 90k stack in a single loop. Since NORMSIZE is 1000, the base addresses of the arrays are not separated by a power of 2, and so it manages much better L1 usage. If, in the VLA case, I make NORMSIZE 1024, the performance gain disappears.
So: should we just compile by default with -fno-stack-clash-protection -fno-stack-protector? I don't know if this provides anything, security wise, when FORM has #write and #system anyway?
I will also investigate further using the WorkSpace to hold this data.
—
Reply to this email directly, view it on GitHub <#803 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJPCEXB5N2RZVH3HLLJIQT4N4LRFAVCNFSM6AAAAACWAPB6B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSNRXGYZDGNJXGI>.
You are receiving this because you are subscribed to this thread.
|
|
When you programmed this, "stack clash protection" did not exist ;) GCC added it in 2018 (8.0). You can get the same performance improvement by allocating arrays in AT for Normalize to use. You need to allocate, I think, twice the required size. As far as I can tell the only way to have a recursive Normalize call in the current test suite and benchmarks is So, what seems to be the better solution, which both have similar performance?
|
|
There is a third option. You allocate it in AT as an array that can be expanded and you keep counters how deep you have gone.
Currently we do not go beyond two (I believe), but you never know what the future brings.
This option suffers from more computations of addresses, but would be ironclad wrt the future.
Hence yu have a structured with all those variables, and in AT a pointer to an array with addresses of those structs. When needed you extend the array, and you never make it smaller (for speed).
… On 26 Feb 2026, at 21:33, jodavies ***@***.***> wrote:
jodavies
left a comment
(form-dev/form#803)
<#803 (comment)>
When you programmed this, "stack clash protection" did not exist ;) GCC added it in 2018 (8.0).
You can get the same performance improvement by allocating arrays in AT for Normalize to use. You need to allocate, I think, twice the required size. As far as I can tell the only way to have a recursive Normalize call in the current test suite and benchmarks is Normalize -> ExpandRat -> Normalize. I think this is much simpler than messing around with WorkSpace, since many of the functions which Normalize calls, use the WorkSpace.
So, what seems to be the better solution, which both have similar performance?
Compile by default with -fno-stack-clash-protection. This is simple and doesn't change FORM's behaviour/code at all.
Allocate arrays in AT. This then implies a maximum recursion depth for Normalize. The advantage is that buffer overruns in these arrays will be caught by valgrind. Currently if one overruns a buffer, there is silent corruption of the rest of the stack.
—
Reply to this email directly, view it on GitHub <#803 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJPCEQLUZYB5TI24ZCLKM34N5KCBAVCNFSM6AAAAACWAPB6B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSNRZGA3TKOBVGQ>.
You are receiving this because you commented.
|
Large stack allocations have a performance implication due to stack clash protection. Dynamically allocate the buffers needed by Normalize in AT, instead. This leads to a large performance improvement.
3f5cf71 to
070aaaf
Compare
|
Here is the next iteration. Pointers to Normalize buffers live in a struct, and the space is allocated dynamically such that valgrind can catch errors in their use. I removed the user control of NORMSIZE again. One set of buffers is allocated at startup (per thread), and another is allocated if necessary. For now, I made it so that debugging builds Terminate if Normalize is called with more than two recursions, which we don't expect to happen. The performance numbers are more-or-less the same as the first comment. |
Normalize is always a large contributor to FORM's run time. Profiling reveals that the large stack allocations in this function are costly: since NORMSIZE is 1000, they total ~90KB.
I tried running the usual benchmarks with NORMSIZE set to 100, which is sufficient for these tests, and the performance difference is rather large.
Since we can't just reduce NORMSIZE without (in principle) breaking user scripts which have very complicated terms, I experimented a bit:
I decided that one option is to just make NORMSIZE a user-controlled parameter, with default 1000. Then nothing will break, and users can experiment with making it smaller to speed up their scripts. This PR implements this, by making a "NormSize" setup parameter, and using Variable Length Arrays in Normalize. VLAs are in c99 but optional in c11.
What I don't yet understand, is that I now get the same performance improvement WITHOUT reducing the value from 1000.
Here are the numbers. Can someone try to reproduce this?
The github runners seem a bit dodgy recently... I think the CI should pass.