Skip to content

Added sb instruction support for ARMv9 architecture#66

Open
salvatoredipietro wants to merge 1 commit intofacebook:devfrom
salvatoredipietro:dev-isb
Open

Added sb instruction support for ARMv9 architecture#66
salvatoredipietro wants to merge 1 commit intofacebook:devfrom
salvatoredipietro:dev-isb

Conversation

@salvatoredipietro
Copy link

We would like to add the support for ARMv9 architecture to use sb instruction instead of isb.
Based on our micro benchmark (patch attached), sb is ~30% faster than isb (ratio=1.710:1) and the change do not seems to introduce any regression on the spin_cpu_spinwait() function (isb_spin=8740725us vs standard_spin=8739722us).

# Jemalloc build
$ make clean all ; ./autogen.sh && ./configure && make -j4

# Test on m8g.2xlarge with patch
$ make tests_stress && ./test/stress/arm_spin_bench
Running on ARM64 architecture
SB instruction is supported
1000000 iterations, isb_spin=8740725us (8740.725 ns/iter), standard_spin=5110839us (5110.839 ns/iter), time consumption ratio=1.710:1

# Test on m8g.2xlarge without patch 
1000000 iterations, isb_spin=8739722us (8739.722 ns/iter), standard_spin=8739722us (8739.722 ns/iter), time consumption ratio=1.000:1

Original post: jemalloc#2843

@meta-cla meta-cla bot added the cla signed label Jan 8, 2026
@lexprfuncall lexprfuncall self-assigned this Jan 17, 2026
@lexprfuncall
Copy link

Thanks for the PR! Do you have any thoughts on the use of a higher-throughput instruction for a delay loop? Intuitively, you need to waste some time so the difference between an sb and an isb could be either a win or a lose depending on the number of iterations.

As an aside, the approach suggested by ARM involves uses a more precise delay loop to account for the differences. That might require a slightly bigger change to the spin_adaptive implementation:

https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm

@salvatoredipietro
Copy link
Author

Reducing the waiting time, give more possibility to spinlock to succeed faster. Using lockhammer (a locking/contention benchmark) we see a improvement of 11-18%. Also Folly project (facebook/folly#2390) saw a good perf bump ( between 20% and 30% with >=16 threads).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants