Skip to content

block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT#802

Open
blktests-ci[bot] wants to merge 1 commit into
linus-master_basefrom
series/1090278=>linus-master
Open

block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT#802
blktests-ci[bot] wants to merge 1 commit into
linus-master_basefrom
series/1090278=>linus-master

Conversation

@blktests-ci
Copy link
Copy Markdown

@blktests-ci blktests-ci Bot commented May 6, 2026

Pull request for series with
subject: block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT
version: 6
url: https://patchwork.kernel.org/project/linux-block/list/?series=1090278

@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 6, 2026

Upstream branch: 6d35786
series: https://patchwork.kernel.org/project/linux-block/list/?series=1090278
version: 6

@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 10, 2026

Upstream branch: aa54b1d
series: https://patchwork.kernel.org/project/linux-block/list/?series=1090278
version: 6

@blktests-ci blktests-ci Bot force-pushed the series/1090278=>linus-master branch from 16ab228 to fe6d375 Compare May 10, 2026 16:16
@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 12, 2026

Upstream branch: aa54b1d
series: https://patchwork.kernel.org/project/linux-block/list/?series=1093294
version: 7

@blktests-ci blktests-ci Bot added V7 and removed V6 labels May 12, 2026
@blktests-ci blktests-ci Bot force-pushed the series/1090278=>linus-master branch from fe6d375 to fb405e9 Compare May 12, 2026 06:35
@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 12, 2026

Upstream branch: aa54b1d
series: https://patchwork.kernel.org/project/linux-block/list/?series=1093294
version: 7

@blktests-ci blktests-ci Bot force-pushed the series/1090278=>linus-master branch from fb405e9 to 8ee03f2 Compare May 12, 2026 17:58
@blktests-ci blktests-ci Bot force-pushed the linus-master_base branch from b1870f6 to ca57796 Compare May 15, 2026 07:55
@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 15, 2026

Upstream branch: 70eda68
series: https://patchwork.kernel.org/project/linux-block/list/?series=1093294
version: 7

@blktests-ci blktests-ci Bot force-pushed the series/1090278=>linus-master branch from 8ee03f2 to 494b9e2 Compare May 15, 2026 08:05
@blktests-ci blktests-ci Bot force-pushed the linus-master_base branch from ca57796 to c1feb59 Compare May 21, 2026 02:54
@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 21, 2026

Upstream branch: 8bc67e4
series: https://patchwork.kernel.org/project/linux-block/list/?series=1093294
version: 7

@blktests-ci blktests-ci Bot force-pushed the series/1090278=>linus-master branch from 494b9e2 to 73e448e Compare May 21, 2026 03:29
@blktests-ci blktests-ci Bot force-pushed the linus-master_base branch from c1feb59 to ea833a1 Compare May 22, 2026 01:53
@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 22, 2026

Upstream branch: 6779b50
series: https://patchwork.kernel.org/project/linux-block/list/?series=1093294
version: 7

@blktests-ci blktests-ci Bot force-pushed the series/1090278=>linus-master branch from 73e448e to de77f90 Compare May 22, 2026 02:26
@blktests-ci blktests-ci Bot force-pushed the linus-master_base branch from ea833a1 to 7af85d1 Compare May 23, 2026 06:11
@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 23, 2026

Upstream branch: 79bd2dd
series: https://patchwork.kernel.org/project/linux-block/list/?series=1093294
version: 7

@blktests-ci blktests-ci Bot force-pushed the series/1090278=>linus-master branch from de77f90 to f069506 Compare May 23, 2026 07:22
@blktests-ci blktests-ci Bot force-pushed the linus-master_base branch from 7af85d1 to de94ac7 Compare May 23, 2026 17:08
@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 23, 2026

Upstream branch: eed108e
series: https://patchwork.kernel.org/project/linux-block/list/?series=1093294
version: 7

… on RT

On PREEMPT_RT kernels, commit 6bda857 ("block: fix ordering
between checking QUEUE_FLAG_QUIESCED request adding") causes a severe
throughput regression on systems with many MSI-X interrupt vectors.

That commit closed a store/load race between blk_mq_run_hw_queue() and
blk_mq_unquiesce_queue() by taking q->queue_lock around the requiesce
re-check in blk_mq_run_hw_queue().  Its changelog noted two ways to fix
the race -- (1) a pair of memory barriers, or (2) the queue_lock -- and
picked (2) because barriers are harder to maintain.

On RT, spinlock_t becomes a sleeping rt_mutex.  blk_mq_run_hw_queue() is
called from every IRQ thread, and the re-check path is hit on the very
common "nothing pending" case, so all IRQ threads end up serialising on
the single q->queue_lock and block in D-state.  On a Broadcom/LSI
MegaRAID 12GSAS/PCIe Secure SAS39xx (megaraid_sas, 128 MSI-X vectors,
120 hw queues) throughput drops from 640 MB/s to 153 MB/s.

Take approach (1) instead, and while at it turn quiesce_depth into the
single source of truth for the quiesce state:

 - quiesce_depth becomes atomic_t and QUEUE_FLAG_QUIESCED is removed;
   blk_queue_quiesced() is now "atomic_read(&q->quiesce_depth) > 0".
   This also makes blk_queue_quiesced(), which is read locklessly from
   the dispatch path, a clean atomic load instead of a plain-int read
   racing with a spin_lock-protected int update.

 - blk_mq_quiesce_queue_nowait() does an atomic_inc() followed by
   smp_mb__after_atomic().  The spin_lock() it used to take only served
   to publish the state change; every caller still follows the quiesce
   with blk_mq_wait_quiesce_done() (synchronize_srcu()/synchronize_rcu()),
   which is what actually drains in-flight dispatchers and makes the new
   state globally visible.  The barrier here just keeps the helper
   self-contained for the few callers that defer that wait.

 - blk_mq_unquiesce_queue() uses atomic_dec_if_positive() (so the
   WARN-on-underflow check and the decrement are one atomic op) followed
   by smp_mb__after_atomic() before blk_mq_run_hw_queues().  This is the
   write side of the race fixed above: a full barrier between the
   quiesce_depth store and the blk_mq_hctx_has_pending() load.

 - blk_mq_run_hw_queue() drops the q->queue_lock around the requiesce
   re-check and uses smp_mb() instead.  This is the read side: a full
   barrier between the just-inserted request (the store that makes
   blk_mq_hctx_has_pending() true) and the quiesce-state load.  A full
   barrier is required on both sides -- this is a classic store-buffer
   pattern -- so smp_mb()/smp_mb__after_atomic() rather than a read
   barrier; with that, at least one of the two racing CPUs observes the
   other's store and the hw queue is not left both un-quiesced and not
   rerun.

No locking remains on the dispatch hot path.

Performance on the RT kernel and the hardware above:
 - Before: 153 MB/s, IRQ threads in D-state on q->queue_lock
 - After:  640 MB/s, no IRQ threads blocked

The non-RT path replaces a queue_lock acquire/release on the re-check
with an smp_mb(), so it should be no worse, and it also stops taking
q->queue_lock from blk_mq_run_hw_queue() entirely.

Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 6bda857 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
@blktests-ci blktests-ci Bot force-pushed the series/1090278=>linus-master branch from f069506 to 8ea3510 Compare May 23, 2026 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant