block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT#802
block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT#802blktests-ci[bot] wants to merge 1 commit into
Conversation
|
Upstream branch: 6d35786 |
1f0d33a to
b1870f6
Compare
|
Upstream branch: aa54b1d |
16ab228 to
fe6d375
Compare
|
Upstream branch: aa54b1d |
fe6d375 to
fb405e9
Compare
|
Upstream branch: aa54b1d |
fb405e9 to
8ee03f2
Compare
b1870f6 to
ca57796
Compare
|
Upstream branch: 70eda68 |
8ee03f2 to
494b9e2
Compare
ca57796 to
c1feb59
Compare
|
Upstream branch: 8bc67e4 |
494b9e2 to
73e448e
Compare
c1feb59 to
ea833a1
Compare
|
Upstream branch: 6779b50 |
73e448e to
de77f90
Compare
ea833a1 to
7af85d1
Compare
|
Upstream branch: 79bd2dd |
de77f90 to
f069506
Compare
7af85d1 to
de94ac7
Compare
|
Upstream branch: eed108e |
… on RT On PREEMPT_RT kernels, commit 6bda857 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding") causes a severe throughput regression on systems with many MSI-X interrupt vectors. That commit closed a store/load race between blk_mq_run_hw_queue() and blk_mq_unquiesce_queue() by taking q->queue_lock around the requiesce re-check in blk_mq_run_hw_queue(). Its changelog noted two ways to fix the race -- (1) a pair of memory barriers, or (2) the queue_lock -- and picked (2) because barriers are harder to maintain. On RT, spinlock_t becomes a sleeping rt_mutex. blk_mq_run_hw_queue() is called from every IRQ thread, and the re-check path is hit on the very common "nothing pending" case, so all IRQ threads end up serialising on the single q->queue_lock and block in D-state. On a Broadcom/LSI MegaRAID 12GSAS/PCIe Secure SAS39xx (megaraid_sas, 128 MSI-X vectors, 120 hw queues) throughput drops from 640 MB/s to 153 MB/s. Take approach (1) instead, and while at it turn quiesce_depth into the single source of truth for the quiesce state: - quiesce_depth becomes atomic_t and QUEUE_FLAG_QUIESCED is removed; blk_queue_quiesced() is now "atomic_read(&q->quiesce_depth) > 0". This also makes blk_queue_quiesced(), which is read locklessly from the dispatch path, a clean atomic load instead of a plain-int read racing with a spin_lock-protected int update. - blk_mq_quiesce_queue_nowait() does an atomic_inc() followed by smp_mb__after_atomic(). The spin_lock() it used to take only served to publish the state change; every caller still follows the quiesce with blk_mq_wait_quiesce_done() (synchronize_srcu()/synchronize_rcu()), which is what actually drains in-flight dispatchers and makes the new state globally visible. The barrier here just keeps the helper self-contained for the few callers that defer that wait. - blk_mq_unquiesce_queue() uses atomic_dec_if_positive() (so the WARN-on-underflow check and the decrement are one atomic op) followed by smp_mb__after_atomic() before blk_mq_run_hw_queues(). This is the write side of the race fixed above: a full barrier between the quiesce_depth store and the blk_mq_hctx_has_pending() load. - blk_mq_run_hw_queue() drops the q->queue_lock around the requiesce re-check and uses smp_mb() instead. This is the read side: a full barrier between the just-inserted request (the store that makes blk_mq_hctx_has_pending() true) and the quiesce-state load. A full barrier is required on both sides -- this is a classic store-buffer pattern -- so smp_mb()/smp_mb__after_atomic() rather than a read barrier; with that, at least one of the two racing CPUs observes the other's store and the hw queue is not left both un-quiesced and not rerun. No locking remains on the dispatch hot path. Performance on the RT kernel and the hardware above: - Before: 153 MB/s, IRQ threads in D-state on q->queue_lock - After: 640 MB/s, no IRQ threads blocked The non-RT path replaces a queue_lock acquire/release on the re-check with an smp_mb(), so it should be no worse, and it also stops taking q->queue_lock from blk_mq_run_hw_queue() entirely. Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Fixes: 6bda857 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding") Cc: stable@vger.kernel.org Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
f069506 to
8ea3510
Compare
Pull request for series with
subject: block/blk-mq: use atomic_t for quiesce_depth to avoid lock contention on RT
version: 6
url: https://patchwork.kernel.org/project/linux-block/list/?series=1090278