Skip to content

kqueue: elide mutex and atomic barriers in single-threaded mode#139

Closed
mvandeberg wants to merge 1 commit intocppalliance:developfrom
mvandeberg:feature/kqueue-elide-mutex
Closed

kqueue: elide mutex and atomic barriers in single-threaded mode#139
mvandeberg wants to merge 1 commit intocppalliance:developfrom
mvandeberg:feature/kqueue-elide-mutex

Conversation

@mvandeberg
Copy link
Contributor

@mvandeberg mvandeberg commented Feb 13, 2026

When concurrency_hint == 1, the scheduler now bypasses all synchronization overhead. conditional_mutex/conditional_unique_lock skip pthread_mutex calls entirely, conditional_atomic decomposes RMW ops (fetch_add/fetch_sub/exchange/CAS) into plain relaxed load+modify+store — eliminating ldaxr/stlxr exclusive pairs on ARM64 and LOCK prefixes on x86. Mutexes spin briefly (40 iterations with yield/pause hints) before falling back to the OS futex.

Benchmark factory updated to accept a concurrency hint so single-threaded benchmarks explicitly opt in with factory(1).

Summary by CodeRabbit

Release Notes

  • New Features

    • Added concurrency-aware synchronization to optimize performance for single-threaded scenarios.
    • Enhanced context factory to accept concurrency hints for improved resource utilization.
  • Refactor

    • Updated all backends to use lightweight synchronization primitives with optional locking when concurrency is not required.

When concurrency_hint == 1, the scheduler now bypasses all
synchronization overhead. conditional_mutex/conditional_unique_lock
skip pthread_mutex calls entirely, conditional_atomic decomposes
RMW ops (fetch_add/fetch_sub/exchange/CAS) into plain relaxed
load+modify+store — eliminating ldaxr/stlxr exclusive pairs on
ARM64 and LOCK prefixes on x86. Mutexes spin briefly (40 iterations
with yield/pause hints) before falling back to the OS futex.

Benchmark factory updated to accept a concurrency hint so
single-threaded benchmarks explicitly opt in with factory(1).
@coderabbitai
Copy link

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

Introduced conditional synchronization primitives (conditional_atomic, conditional_mutex, conditional_event) to enable optional single-threaded optimization. Modified the context_factory interface to accept a concurrency hint parameter. Updated all benchmark invocations and kqueue scheduler to integrate concurrency-aware locking and factory construction.

Changes

Cohort / File(s) Summary
Backend Factory Interface
perf/common/backend_selection.hpp
Replaced function pointer typedef with context_factory struct accepting unsigned concurrency hint. Updated all backend factory lambdas (epoll, kqueue, select, iocp) to forward hint to context constructors.
Benchmark Factory Calls
perf/profile/concurrent_io_bench.cpp, perf/profile/coroutine_post_bench.cpp, perf/profile/queue_depth_bench.cpp, perf/profile/scheduler_contention_bench.cpp, perf/profile/small_io_bench.cpp
Updated factory invocations to pass concurrency hint: factory(num_threads) in workload setup and factory(1) in warmup blocks.
Conditional Synchronization Primitives
src/corosio/src/detail/conditional_atomic.hpp, src/corosio/src/detail/conditional_mutex.hpp
Added new headers implementing conditional_atomic<T>, conditional_mutex, conditional_unique_lock, conditional_event, and spin_pause() utility for optional single-threaded optimization with relaxed/non-atomic fast paths when disabled.
Kqueue Backend Integration
src/corosio/src/detail/kqueue/op.hpp, src/corosio/src/detail/kqueue/scheduler.hpp, src/corosio/src/detail/kqueue/scheduler.cpp
Replaced std::mutex/std::atomic/std::condition_variable with conditional variants. Updated constructor to accept concurrency_hint and propagate locking configuration. Changed all lock parameter types from std::unique_lock<std::mutex>& to conditional_unique_lock&.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related issues

  • Implement kqueue reactor #13: Directly implements concurrency-aware kqueue scheduler with conditional locking primitives and updates to core scheduler internals for the kqueue reactor implementation.

Possibly related PRs

Poem

🐰 A whisker-twitch of optimization true,
Conditional locks where single threads pass through,
Concurrency hints now flow with grace,
Kqueue runs faster in its rightful place!
✨ No barriers where none are due.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary change: introducing optimizations to elide mutex and atomic barriers when operating in single-threaded mode (concurrency_hint == 1), which is the main objective of the pull request.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into develop

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Feb 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.20%. Comparing base (4777a92) to head (138aba6).
⚠️ Report is 6 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #139      +/-   ##
===========================================
+ Coverage    81.03%   81.20%   +0.17%     
===========================================
  Files           64       64              
  Lines         5710     5710              
===========================================
+ Hits          4627     4637      +10     
+ Misses        1083     1073      -10     

see 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4777a92...138aba6. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cppalliance-bot
Copy link

An automated preview of the documentation is available at https://139.corosio.prtest3.cppalliance.org/index.html

If more commits are pushed to the pull request, the docs will rebuild at the same URL.

2026-02-13 00:25:40 UTC

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/corosio/src/detail/conditional_mutex.hpp`:
- Around line 35-42: The spin_pause() implementation uses GCC/Clang-only
intrinsics and will not compile under MSVC; update spin_pause() to add
MSVC-specific branches: include <intrin.h> when _MSC_VER is defined, use
__yield() for ARM64 (_M_ARM64) and _mm_pause() for x86 (_M_X64, _M_IX86) instead
of __asm__ volatile and __builtin_ia32_pause(), and keep the existing GCC/Clang
branches for non-MSVC compilers; ensure all compiler/arch checks are properly
`#ifdef-guarded` so the header remains portable.
🧹 Nitpick comments (4)
src/corosio/src/detail/conditional_mutex.hpp (1)

149-153: conditional_event::wait() is silently a no-op when locking is disabled — document the busy-spin hazard.

When disabled, lock.underlying().owns_lock() is false, so wait() returns immediately without blocking. Any caller that loops on a condition using cond_.wait(lock) (e.g., wait_for_signal) would degrade into an infinite busy-spin consuming 100% CPU.

Currently this is safe because wait_for_signal is structurally unreachable in single-threaded mode (the reactor sentinel ensures the queue is never empty). However, this is a non-obvious invariant — a future refactor that breaks that invariant would silently introduce a livelock. Consider adding an assertion or a brief comment at the call site to document why this is safe.

src/corosio/src/detail/kqueue/scheduler.hpp (1)

206-208: Note: kqueue scheduler now diverges from the epoll scheduler's locking interface.

The epoll scheduler still uses std::unique_lock<std::mutex>& (per src/corosio/src/detail/epoll/scheduler.hpp:172). If the single-threaded optimization proves successful, consider applying the same treatment to the epoll backend to keep the implementations aligned.

src/corosio/src/detail/kqueue/scheduler.cpp (2)

691-697: Verify descriptor mutex is never used concurrently before register_descriptor configures it.

The conditional_mutex in descriptor_state is default-constructed with enabled = true. Lines 691–692 then reconfigure it to match the scheduler's settings. Between construction and this point, the mutex is in a fully-locked (enabled) mode regardless of concurrency_hint. This is safe because register_descriptor is called during socket setup before any I/O events can fire, but it would be slightly more robust to construct descriptor_state::mutex in the disabled state or accept the enabled flag at construction time to avoid any window of incorrect configuration.


843-867: wait_for_signal / wait_for_signal_for silently degrade in single-threaded mode.

When locking is disabled, cond_.wait(lock) is a no-op (the underlying std::unique_lock doesn't own the mutex). wait_for_signal becomes an infinite busy-spin, and wait_for_signal_for becomes a single non-blocking check — effectively poll_one() semantics regardless of the requested timeout.

As currently designed, the reactor sentinel in the completed-ops queue prevents these paths from being reached in single-threaded mode. However, this invariant is non-obvious. A defensive assertion would catch any future regression:

Defensive assertion
 void
 kqueue_scheduler::
 wait_for_signal(conditional_unique_lock& lock) const
 {
+    // Unreachable in single-threaded mode: the reactor sentinel
+    // ensures the queue is never empty when outstanding_work > 0.
+    assert(locking_enabled() && "wait_for_signal reached in single-threaded mode");
     while ((state_ & signaled_bit) == 0)
     {

Comment on lines +35 to +42
inline void spin_pause() noexcept
{
#if defined(__aarch64__) || defined(_M_ARM64)
__asm__ volatile("yield");
#elif defined(__x86_64__) || defined(__i386__) || defined(_M_X64) || defined(_M_IX86)
__builtin_ia32_pause();
#endif
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

spin_pause() will fail to compile under MSVC due to GCC/Clang-only intrinsics.

The MSVC-specific macros (_M_ARM64, _M_X64, _M_IX86) are detected, but the code paths use __asm__ volatile and __builtin_ia32_pause() which are GCC/Clang intrinsics. MSVC requires __yield() (ARM64) and _mm_pause() (x86) from <intrin.h>. Since this header is in the shared detail/ directory (not under kqueue/), it should be portable in case other backends adopt it.

Suggested portable fix
 inline void spin_pause() noexcept
 {
-#if defined(__aarch64__) || defined(_M_ARM64)
+#if defined(__aarch64__)
     __asm__ volatile("yield");
-#elif defined(__x86_64__) || defined(__i386__) || defined(_M_X64) || defined(_M_IX86)
+#elif defined(_M_ARM64)
+    __yield();
+#elif defined(__x86_64__) || defined(__i386__)
     __builtin_ia32_pause();
+#elif defined(_M_X64) || defined(_M_IX86)
+    _mm_pause();
 `#endif`
 }

The MSVC paths would also need #include <intrin.h> guarded by #ifdef _MSC_VER.

🤖 Prompt for AI Agents
In `@src/corosio/src/detail/conditional_mutex.hpp` around lines 35 - 42, The
spin_pause() implementation uses GCC/Clang-only intrinsics and will not compile
under MSVC; update spin_pause() to add MSVC-specific branches: include
<intrin.h> when _MSC_VER is defined, use __yield() for ARM64 (_M_ARM64) and
_mm_pause() for x86 (_M_X64, _M_IX86) instead of __asm__ volatile and
__builtin_ia32_pause(), and keep the existing GCC/Clang branches for non-MSVC
compilers; ensure all compiler/arch checks are properly `#ifdef-guarded` so the
header remains portable.

@cppalliance-bot
Copy link

GCOVR code coverage report https://139.corosio.prtest3.cppalliance.org/gcovr/index.html
LCOV code coverage report https://139.corosio.prtest3.cppalliance.org/genhtml/index.html
Coverage Diff Report https://139.corosio.prtest3.cppalliance.org/diff-report/index.html

Build time: 2026-02-13 00:33:13 UTC

@mvandeberg
Copy link
Contributor Author

We are going to implement this at compile time. Closing.

@mvandeberg mvandeberg closed this Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants