Skip to content

blk-mq: add tracepoint block_rq_tag_wait#876

Open
blktests-ci[bot] wants to merge 1 commit into
linus-master_basefrom
series/1099901=>linus-master
Open

blk-mq: add tracepoint block_rq_tag_wait#876
blktests-ci[bot] wants to merge 1 commit into
linus-master_basefrom
series/1099901=>linus-master

Conversation

@blktests-ci
Copy link
Copy Markdown

@blktests-ci blktests-ci Bot commented May 23, 2026

Pull request for series with
subject: blk-mq: add tracepoint block_rq_tag_wait
version: 7
url: https://patchwork.kernel.org/project/linux-block/list/?series=1099901

@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 23, 2026

Upstream branch: eed108e
series: https://patchwork.kernel.org/project/linux-block/list/?series=1099901
version: 7

@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented May 24, 2026

Upstream branch: eed108e
series: https://patchwork.kernel.org/project/linux-block/list/?series=1099913
version: 8

@blktests-ci blktests-ci Bot added V8 and removed V7 labels May 24, 2026
In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.

This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:

    # bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
    Attaching 1 probe...
    ^C
    @tag_waits[4]: 12
    @tag_waits[12]: 87

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
@blktests-ci blktests-ci Bot force-pushed the series/1099901=>linus-master branch from d489bb9 to 03e4345 Compare May 24, 2026 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant