From 45fba366173e4814c494dc2659e9f351aea2d2e9 Mon Sep 17 00:00:00 2001 From: Daniel Wagner Date: Thu, 21 May 2026 19:29:49 -0400 Subject: [PATCH 1/8] scsi: aacraid: use block layer helpers to calculate num of queues The calculation of the upper limit for queues does not depend solely on the number of online CPUs; for example, the isolcpus kernel command-line option must also be considered. To account for this, the block layer provides a helper function to retrieve the maximum number of queues. Use it to set an appropriate upper queue number limit. This patch brings aacraid in line with the API migration initiated for other SCSI drivers in commit 94970cfb5f10 ("scsi: use block layer helpers to calculate num of queues"). Signed-off-by: Daniel Wagner Reviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke [atomlin: Drop "Fixes:" tag; indicate alignment with other SCSI drivers] Signed-off-by: Aaron Tomlin --- drivers/scsi/aacraid/comminit.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c index 9bd3f5b868bcd..ec165b57182d3 100644 --- a/drivers/scsi/aacraid/comminit.c +++ b/drivers/scsi/aacraid/comminit.c @@ -469,8 +469,7 @@ void aac_define_int_mode(struct aac_dev *dev) } /* Don't bother allocating more MSI-X vectors than cpus */ - msi_count = min(dev->max_msix, - (unsigned int)num_online_cpus()); + msi_count = blk_mq_num_online_queues(dev->max_msix); dev->max_msix = msi_count; From 76ff58f45005cfe9b1b1783b59464862201367cc Mon Sep 17 00:00:00 2001 From: Daniel Wagner Date: Thu, 21 May 2026 19:29:50 -0400 Subject: [PATCH 2/8] lib/group_cpus: remove dead !SMP code The core scheduler recently transitioned to compiling SMP data structures unconditionally to reduce code complexity - see commit cac5cefbade9 ("sched/smp: Make SMP unconditional"). In alignment with this philosophy of reducing dual-path maintenance, this patch removes the #ifdef CONFIG_SMP guards and the dedicated !SMP fallback logic here. While the !SMP path provided a slightly simpler execution flow for uniprocessor kernels (avoiding SMP-specific overhead), maintaining these separate code paths adds unnecessary complexity and testing burden. Removing these guards simplifies the codebase by standardizing entirely on the SMP logic, which safely resolves to single-CPU operations on UP configurations. Signed-off-by: Daniel Wagner Reviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke [atomlin: Updated commit message to clarify !SMP removal context] Signed-off-by: Aaron Tomlin --- lib/group_cpus.c | 20 -------------------- 1 file changed, 20 deletions(-) diff --git a/lib/group_cpus.c b/lib/group_cpus.c index e6e18d7a49bba..b8d54398f88a1 100644 --- a/lib/group_cpus.c +++ b/lib/group_cpus.c @@ -9,8 +9,6 @@ #include #include -#ifdef CONFIG_SMP - static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk, unsigned int cpus_per_grp) { @@ -564,22 +562,4 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks) *nummasks = min(nr_present + nr_others, numgrps); return masks; } -#else /* CONFIG_SMP */ -struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks) -{ - struct cpumask *masks; - - if (numgrps == 0) - return NULL; - - masks = kzalloc_objs(*masks, numgrps); - if (!masks) - return NULL; - - /* assign all CPUs(cpu 0) to the 1st group only */ - cpumask_copy(&masks[0], cpu_possible_mask); - *nummasks = 1; - return masks; -} -#endif /* CONFIG_SMP */ EXPORT_SYMBOL_GPL(group_cpus_evenly); From 28c136681a88c5684e2289ae81dce959d2c6ec28 Mon Sep 17 00:00:00 2001 From: Daniel Wagner Date: Thu, 21 May 2026 19:29:51 -0400 Subject: [PATCH 3/8] lib/group_cpus: Add group_mask_cpus_evenly() This commit introduces group_mask_cpus_evenly(), which allows callers to distribute a specific CPU mask evenly across groups. It serves as a bounded version of group_cpus_evenly(). While group_cpus_evenly() operates on the global cpu_possible_mask, group_mask_cpus_evenly() confines the distribution strictly within the boundaries of the caller-provided mask. It preserves the kernel's native two-stage spreading logic-first prioritising CPUs that are physically present (cpu_present_mask) to prevent I/O starvation, and then distributing any remaining vectors to non-present CPUs to maintain hotplug safety. Signed-off-by: Daniel Wagner Reviewed-by: Hannes Reinecke [atomlin: - Added check for numgrps == 0 - Updated commit message to resolve typo - Removed unused - Fix TOCTOU race by caching the provided mask - Removed ineffective data_race() annotations around cpumask pointers - Implemented two-stage grouping logic to prioritise physically present CPUs, mirroring group_cpus_evenly() - Fix division-by-zero bug by ensuring group_mask_cpus_evenly() returns NULL instead of an empty array when evaluated against an empty mask] Signed-off-by: Aaron Tomlin --- include/linux/group_cpus.h | 3 + lib/group_cpus.c | 110 +++++++++++++++++++++++++++++++++++++ 2 files changed, 113 insertions(+) diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h index 9d4e5ab6c314b..defab4123a82f 100644 --- a/include/linux/group_cpus.h +++ b/include/linux/group_cpus.h @@ -10,5 +10,8 @@ #include struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks); +struct cpumask *group_mask_cpus_evenly(unsigned int numgrps, + const struct cpumask *mask, + unsigned int *nummasks); #endif diff --git a/lib/group_cpus.c b/lib/group_cpus.c index b8d54398f88a1..75bd082e00bf4 100644 --- a/lib/group_cpus.c +++ b/lib/group_cpus.c @@ -563,3 +563,113 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *nummasks) return masks; } EXPORT_SYMBOL_GPL(group_cpus_evenly); + +/** + * group_mask_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality + * @numgrps: number of cpumasks to create + * @mask: CPUs to consider for the grouping + * @nummasks: number of initialized cpumasks + * + * Return: cpumask array if successful, NULL otherwise. Only the CPUs + * marked in the mask will be considered for the grouping. And each + * element includes CPUs assigned to this group. nummasks contains the + * number of initialized masks which can be less than numgrps. + * + * Try to put close CPUs from viewpoint of CPU and NUMA locality into + * the same group. + * + * We guarantee in the resulting grouping that all CPUs specified in the + * provided mask are covered, and no same CPU is assigned to multiple + * groups. + */ +struct cpumask *group_mask_cpus_evenly(unsigned int numgrps, + const struct cpumask *mask, + unsigned int *nummasks) +{ + unsigned int curgrp = 0, nr_present = 0, nr_others = 0; + cpumask_var_t *node_to_cpumask; + cpumask_var_t nmsk, local_mask, npresmsk; + int ret = -ENOMEM; + struct cpumask *masks = NULL; + + if (numgrps == 0) + return NULL; + + if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) + return NULL; + + if (!zalloc_cpumask_var(&local_mask, GFP_KERNEL)) + goto fail_nmsk; + + if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) + goto fail_local_mask; + + node_to_cpumask = alloc_node_to_cpumask(); + if (!node_to_cpumask) + goto fail_npresmsk; + + masks = kzalloc_objs(*masks, numgrps); + if (!masks) + goto fail_node_to_cpumask; + + build_node_to_cpumask(node_to_cpumask); + + /* + * Create a stable snapshot of the mask. The grouping algorithm + * requires the CPU count to remain constant across its multiple + * passes. This prevents allocation failures if the caller passes a + * dynamic mask (e.g., cpu_online_mask) that changes concurrently. + */ + cpumask_copy(local_mask, mask); + + /* + * Grouping present CPUs first. We intersect the provided mask with + * cpu_present_mask to ensure that we prioritise physically + * available CPUs for the initial distribution. + */ + cpumask_and(npresmsk, local_mask, cpu_present_mask); + ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, + npresmsk, nmsk, masks); + if (ret < 0) + goto fail_node_to_cpumask; + nr_present = ret; + + /* + * Allocate non-present CPUs starting from the next group to be + * handled. If the grouping of present CPUs already exhausted the + * group space, assign the non-present CPUs to the already + * allocated out groups. + */ + if (nr_present >= numgrps) + curgrp = 0; + else + curgrp = nr_present; + cpumask_andnot(npresmsk, local_mask, npresmsk); + ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, + npresmsk, nmsk, masks); + if (ret >= 0) + nr_others = ret; + +fail_node_to_cpumask: + free_node_to_cpumask(node_to_cpumask); + +fail_npresmsk: + free_cpumask_var(npresmsk); + +fail_local_mask: + free_cpumask_var(local_mask); + +fail_nmsk: + free_cpumask_var(nmsk); + if (ret < 0) { + kfree(masks); + return NULL; + } + *nummasks = min(nr_present + nr_others, numgrps); + if (*nummasks == 0) { + kfree(masks); + return NULL; + } + return masks; +} +EXPORT_SYMBOL_GPL(group_mask_cpus_evenly); From 59ffe8f28beadb2071fbb52db7305cd2c2be629b Mon Sep 17 00:00:00 2001 From: Daniel Wagner Date: Thu, 21 May 2026 19:29:52 -0400 Subject: [PATCH 4/8] isolation: Introduce io_queue isolcpus type Multiqueue drivers spread I/O queues across all CPUs for optimal performance. However, these drivers are not aware of CPU isolation requirements and will distribute queues without considering the isolcpus configuration. Introduce a new isolcpus mask that allows users to define which CPUs should have I/O queues assigned. This is similar to managed_irq, but intended for drivers that do not use the managed IRQ infrastructure Signed-off-by: Daniel Wagner Reviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke Signed-off-by: Aaron Tomlin --- include/linux/sched/isolation.h | 1 + kernel/sched/isolation.c | 7 +++++++ 2 files changed, 8 insertions(+) diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h index cf0fd03dd7a24..30cb9a44365eb 100644 --- a/include/linux/sched/isolation.h +++ b/include/linux/sched/isolation.h @@ -18,6 +18,7 @@ enum hk_type { HK_TYPE_MANAGED_IRQ, /* Inverse of boot-time nohz_full= or isolcpus=nohz arguments */ HK_TYPE_KERNEL_NOISE, + HK_TYPE_IO_QUEUE, HK_TYPE_MAX, /* diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index ef152d401fe20..3406e3024fd43 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -16,6 +16,7 @@ enum hk_flags { HK_FLAG_DOMAIN = BIT(HK_TYPE_DOMAIN), HK_FLAG_MANAGED_IRQ = BIT(HK_TYPE_MANAGED_IRQ), HK_FLAG_KERNEL_NOISE = BIT(HK_TYPE_KERNEL_NOISE), + HK_FLAG_IO_QUEUE = BIT(HK_TYPE_IO_QUEUE), }; DEFINE_STATIC_KEY_FALSE(housekeeping_overridden); @@ -340,6 +341,12 @@ static int __init housekeeping_isolcpus_setup(char *str) continue; } + if (!strncmp(str, "io_queue,", 9)) { + str += 9; + flags |= HK_FLAG_IO_QUEUE; + continue; + } + /* * Skip unknown sub-parameter and validate that it is not * containing an invalid character. From f81b2c5828bf72e2d9a5e02bfc08294aa11e776c Mon Sep 17 00:00:00 2001 From: Daniel Wagner Date: Thu, 21 May 2026 19:29:53 -0400 Subject: [PATCH 5/8] blk-mq: use hk cpus only when isolcpus=io_queue is enabled Extend the capabilities of the generic CPU to hardware queue (hctx) mapping code, so it maps houskeeping CPUs and isolated CPUs to the hardware queues evenly. A hctx is only operational when there is at least one online housekeeping CPU assigned (aka active_hctx). Thus, check the final mapping that there is no hctx which has only offline housekeeing CPU and online isolated CPUs. Example mapping result: 16 online CPUs isolcpus=io_queue,2-3,6-7,12-13 Queue mapping: hctx0: default 0 2 hctx1: default 1 3 hctx2: default 4 6 hctx3: default 5 7 hctx4: default 8 12 hctx5: default 9 13 hctx6: default 10 hctx7: default 11 hctx8: default 14 hctx9: default 15 IRQ mapping: irq 42 affinity 0 effective 0 nvme0q0 irq 43 affinity 0 effective 0 nvme0q1 irq 44 affinity 1 effective 1 nvme0q2 irq 45 affinity 4 effective 4 nvme0q3 irq 46 affinity 5 effective 5 nvme0q4 irq 47 affinity 8 effective 8 nvme0q5 irq 48 affinity 9 effective 9 nvme0q6 irq 49 affinity 10 effective 10 nvme0q7 irq 50 affinity 11 effective 11 nvme0q8 irq 51 affinity 14 effective 14 nvme0q9 irq 52 affinity 15 effective 15 nvme0q10 A corner case is when the number of online CPUs and present CPUs differ and the driver asks for less queues than online CPUs, e.g. 8 online CPUs, 16 possible CPUs isolcpus=io_queue,2-3,6-7,12-13 virtio_blk.num_request_queues=2 Queue mapping: hctx0: default 0 1 2 3 4 5 6 7 8 12 13 hctx1: default 9 10 11 14 15 IRQ mapping irq 27 affinity 0 effective 0 virtio0-config irq 28 affinity 0-1,4-5,8 effective 5 virtio0-req.0 irq 29 affinity 9-11,14-15 effective 0 virtio0-req.1 Noteworthy is that for the normal/default configuration (!isoclpus) the mapping will change for systems which have non hyperthreading CPUs. The main assignment loop will completely rely that group_mask_cpus_evenly to do the right thing. The old code would distribute the CPUs linearly over the hardware context: queue mapping for /dev/nvme0n1 hctx0: default 0 8 hctx1: default 1 9 hctx2: default 2 10 hctx3: default 3 11 hctx4: default 4 12 hctx5: default 5 13 hctx6: default 6 14 hctx7: default 7 15 The assign each hardware context the map generated by the group_mask_cpus_evenly function: queue mapping for /dev/nvme0n1 hctx0: default 0 1 hctx1: default 2 3 hctx2: default 4 5 hctx3: default 6 7 hctx4: default 8 9 hctx5: default 10 11 hctx6: default 12 13 hctx7: default 14 15 In case of hyperthreading CPUs, the resulting map stays the same. Signed-off-by: Daniel Wagner [atomlin: - Updated blk_mq_validate() to use test_bit() for the new bitmap - Replaced __free cleanups with traditional goto unwinding to align with subsystem styling - Updated blk_mq_map_fallback() to use qmap->queue_offset ensuring secondary maps do not incorrectly route to the primary default map - Added a bitmap_empty() check to prevent out-of-bounds CPU routing when all mapped CPUs are offline - Migrated active_hctx to a dynamically sized bitmap to fix an out-of-bounds write when hardware queues exceed the system CPU count - Fixed absolute vs. relative hardware queue index mix-up in blk_mq_map_queues() and validation checks - Fixed typographical errors - Reduced stack frame size of blk_mq_num_queues() - Resolved a TOCTOU race against CPU hotplug events by snapshotting cpu_online_mask to ensure mapping and validation phases agree - Corrected a loop overwrite bug in blk_mq_map_queues() by iterating directly over masks to prevent orphaned queues from being activated - Restored topology-aware multi-queue fallback in blk_mq_map_hw_queues() by correctly routing missing IRQ affinity masks to the map_software path instead of the naive fallback - Fixed a silent validation bypass in blk_mq_map_hw_queues() caused by overlapping IRQ affinity masks by evaluating the active_hctx bitmap in a secondary pass - Hardened isolation logic in blk_mq_map_hw_queues() to require online housekeeping CPUs before marking a hardware queue as active - Enforce safe fallback of 1 when the intersection evaluates to 0] Signed-off-by: Aaron Tomlin --- block/blk-mq-cpumap.c | 238 ++++++++++++++++++++++++++++++++++++++---- 1 file changed, 220 insertions(+), 18 deletions(-) diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 705da074ad6c7..efb02655f59ec 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -22,8 +22,15 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask, { unsigned int num; - num = cpumask_weight(mask); - return min_not_zero(num, max_queues); + if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) + num = cpumask_weight_and(mask, housekeeping_cpumask(HK_TYPE_IO_QUEUE)); + else + num = cpumask_weight(mask); + /* + * Ensure that a count of zero does not inadvertently result in + * allocating the maximum number of queues. + */ + return min_not_zero(num ?: 1U, max_queues); } /** @@ -33,7 +40,8 @@ static unsigned int blk_mq_num_queues(const struct cpumask *mask, * ignored. * * Calculates the number of queues to be used for a multiqueue - * device based on the number of possible CPUs. + * device based on the number of possible CPUs. This helper + * takes isolcpus settings into account. */ unsigned int blk_mq_num_possible_queues(unsigned int max_queues) { @@ -48,7 +56,8 @@ EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues); * ignored. * * Calculates the number of queues to be used for a multiqueue - * device based on the number of online CPUs. + * device based on the number of online CPUs. This helper + * takes isolcpus settings into account. */ unsigned int blk_mq_num_online_queues(unsigned int max_queues) { @@ -56,23 +65,139 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues) } EXPORT_SYMBOL_GPL(blk_mq_num_online_queues); +static bool blk_mq_validate(struct blk_mq_queue_map *qmap, + const unsigned long *active_hctx, + const struct cpumask *online_mask) +{ + /* + * Verify if the mapping is usable when housekeeping + * configuration is enabled + */ + for (int queue = 0; queue < qmap->nr_queues; queue++) { + int cpu; + + if (test_bit(queue, active_hctx)) { + /* + * This hctx has at least one online CPU thus it + * is able to serve any assigned isolated CPU. + */ + continue; + } + + /* + * There is no housekeeping online CPU for this hctx, all + * good as long as all non-housekeeping CPUs are also + * offline. + */ + for_each_cpu(cpu, online_mask) { + if (qmap->mq_map[cpu] != qmap->queue_offset + queue) + continue; + + pr_warn("Unable to create a usable CPU-to-queue mapping with the given constraints\n"); + return false; + } + } + + return true; +} + +static void blk_mq_map_fallback(struct blk_mq_queue_map *qmap) +{ + unsigned int cpu; + + /* + * Map all CPUs to the first hctx of this specific map to ensure + * at least one online CPU is serving it, respecting the map's + * boundaries so secondary maps do not route into the default map. + */ + for_each_possible_cpu(cpu) + qmap->mq_map[cpu] = qmap->queue_offset; +} + void blk_mq_map_queues(struct blk_mq_queue_map *qmap) { - const struct cpumask *masks; + struct cpumask *masks; + const struct cpumask *constraint; unsigned int queue, cpu, nr_masks; + unsigned long *active_hctx; + cpumask_var_t online_mask; - masks = group_cpus_evenly(qmap->nr_queues, &nr_masks); - if (!masks) { - for_each_possible_cpu(cpu) - qmap->mq_map[cpu] = qmap->queue_offset; - return; - } + active_hctx = bitmap_zalloc(qmap->nr_queues, GFP_KERNEL); + if (!active_hctx) + goto fallback; - for (queue = 0; queue < qmap->nr_queues; queue++) { - for_each_cpu(cpu, &masks[queue % nr_masks]) + if (!alloc_cpumask_var(&online_mask, GFP_KERNEL)) + goto free_fallback_hctx; + + /* + * Snapshot online CPUs to prevent TOCTOU races between the + * mapping phase and the validation phase. + */ + cpumask_copy(online_mask, cpu_online_mask); + + if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) + constraint = housekeeping_cpumask(HK_TYPE_IO_QUEUE); + else + constraint = cpu_possible_mask; + + /* Map CPUs to the hardware contexts (hctx) */ + masks = group_mask_cpus_evenly(qmap->nr_queues, constraint, &nr_masks); + if (!masks) + goto free_fallback; + + /* + * Iterate directly over the generated CPU masks. + * Calculate the final, highest hardware queue index that maps to this + * mask. This skips all intermediate overwrites and safely evaluates + * active_hctx only for queues that survive the mapping. + */ + for (unsigned int idx = 0; idx < nr_masks; idx++) { + bool active = false; + queue = qmap->nr_queues - 1 - + ((qmap->nr_queues - 1 - idx) % nr_masks); + + for_each_cpu(cpu, &masks[idx]) { qmap->mq_map[cpu] = qmap->queue_offset + queue; + + if (!active && cpumask_test_cpu(cpu, online_mask)) { + __set_bit(queue, active_hctx); + active = true; + } + } + } + + /* + * If all CPUs in the generated masks are offline, the active_hctx + * bitmap will be empty. Attempting to route unassigned CPUs to an + * empty bitmap will map them out-of-bounds. Fall back instead. + */ + if (bitmap_empty(active_hctx, qmap->nr_queues)) + goto free_fallback; + + /* Map any unassigned CPU evenly to the hardware contexts (hctx) */ + queue = find_first_bit(active_hctx, qmap->nr_queues); + for_each_cpu_andnot(cpu, cpu_possible_mask, constraint) { + qmap->mq_map[cpu] = qmap->queue_offset + queue; + queue = find_next_bit_wrap(active_hctx, qmap->nr_queues, queue + 1); } + + if (!blk_mq_validate(qmap, active_hctx, online_mask)) + goto free_fallback; + kfree(masks); + free_cpumask_var(online_mask); + bitmap_free(active_hctx); + + return; + +free_fallback: + kfree(masks); + free_cpumask_var(online_mask); +free_fallback_hctx: + bitmap_free(active_hctx); + +fallback: + blk_mq_map_fallback(qmap); } EXPORT_SYMBOL_GPL(blk_mq_map_queues); @@ -109,24 +234,101 @@ void blk_mq_map_hw_queues(struct blk_mq_queue_map *qmap, struct device *dev, unsigned int offset) { - const struct cpumask *mask; + cpumask_var_t mask, online_mask; + const struct cpumask *constraint; + unsigned long *active_hctx; unsigned int queue, cpu; if (!dev->bus->irq_get_affinity) + goto map_software; + + active_hctx = bitmap_zalloc(qmap->nr_queues, GFP_KERNEL); + if (!active_hctx) + goto fallback; + + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) { + bitmap_free(active_hctx); goto fallback; + } + + if (!alloc_cpumask_var(&online_mask, GFP_KERNEL)) + goto free_fallback_mask; + + if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) + constraint = housekeeping_cpumask(HK_TYPE_IO_QUEUE); + else + constraint = cpu_possible_mask; + /* + * Snapshot online CPUs to prevent TOCTOU races between the + * mapping phase and the validation phase. + */ + cpumask_copy(online_mask, cpu_online_mask); + + /* Map CPUs to the hardware contexts (hctx) */ for (queue = 0; queue < qmap->nr_queues; queue++) { - mask = dev->bus->irq_get_affinity(dev, queue + offset); - if (!mask) - goto fallback; + const struct cpumask *affinity_mask; + + affinity_mask = dev->bus->irq_get_affinity(dev, offset + queue); + if (!affinity_mask) + goto free_map_software; - for_each_cpu(cpu, mask) + for_each_cpu(cpu, affinity_mask) { qmap->mq_map[cpu] = qmap->queue_offset + queue; + cpumask_set_cpu(cpu, mask); + } + } + + /* + * Evaluate active_hctx after mapping to handle overlapping masks. + * This ensures queues that were overwritten do not falsely pass validation. + */ + for_each_cpu(cpu, mask) { + if (cpumask_test_cpu(cpu, online_mask) && + cpumask_test_cpu(cpu, constraint)) { + queue = qmap->mq_map[cpu] - qmap->queue_offset; + __set_bit(queue, active_hctx); + } + } + + /* + * If all CPUs assigned to this map are offline, the bitmap will + * be empty. Fall back instead of routing out of bounds. + */ + if (bitmap_empty(active_hctx, qmap->nr_queues)) + goto free_fallback; + + /* Map any unassigned CPU evenly to the hardware contexts (hctx) */ + queue = find_first_bit(active_hctx, qmap->nr_queues); + for_each_cpu_andnot(cpu, cpu_possible_mask, mask) { + qmap->mq_map[cpu] = qmap->queue_offset + queue; + queue = find_next_bit_wrap(active_hctx, qmap->nr_queues, queue + 1); } + if (!blk_mq_validate(qmap, active_hctx, online_mask)) + goto free_fallback; + + bitmap_free(active_hctx); + free_cpumask_var(mask); + free_cpumask_var(online_mask); + return; +free_fallback: + free_cpumask_var(online_mask); +free_fallback_mask: + bitmap_free(active_hctx); + free_cpumask_var(mask); + fallback: + blk_mq_map_fallback(qmap); + return; + +free_map_software: + free_cpumask_var(online_mask); + free_cpumask_var(mask); + bitmap_free(active_hctx); +map_software: blk_mq_map_queues(qmap); } EXPORT_SYMBOL_GPL(blk_mq_map_hw_queues); From 030aa913e7a83b0fb8e042bf34409ea2a04c9d1d Mon Sep 17 00:00:00 2001 From: Daniel Wagner Date: Thu, 21 May 2026 19:29:54 -0400 Subject: [PATCH 6/8] blk-mq: prevent offlining hk CPUs with associated online isolated CPUs When isolcpus=io_queue is enabled and the last housekeeping CPU for a given hctx goes offline, no CPU would be left to handle I/O. To prevent I/O stalls, disallow offlining housekeeping CPUs that are still serving isolated CPUs. Signed-off-by: Daniel Wagner Reviewed-by: Hannes Reinecke [atomlin: - Removed duplicate paragraph from commit message - Allow offlining of non-housekeeping CPUs - Fix logic flaw that prematurely rejected valid offline requests - Iterated over cpu_online_mask and manually reverse-mapped CPUs to correctly detect isolated CPUs, as blk_mq_map_swqueue() intentionally prunes them from hctx->cpumask - Drop hctx->queue->disk->disk_name from warning to avoid UAF bug - Ensure isolation constraints are only enforced for CPUs actively mapped to the evaluated hardware queue - Correct pr_warn format specifier] Signed-off-by: Aaron Tomlin --- block/blk-mq.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index 28c2d931e75ea..09ba99bf99187 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3723,6 +3723,56 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) return data.has_rq; } +static bool blk_mq_hctx_can_offline_hk_cpu(struct blk_mq_hw_ctx *hctx, + unsigned int this_cpu) +{ + const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE); + int cpu, fallback_isolated_cpu = -1; + + /* + * If the CPU being offlined is not a housekeeping CPU, + * offlining it will not strand isolated CPUs. Allow it. + */ + if (!cpumask_test_cpu(this_cpu, hk_mask)) + return true; + /* + * If this CPU is not mapped to this specific hardware context, + * offlining it will not affect the context's I/O routing. Allow it. + */ + if (blk_mq_map_queue_type(hctx->queue, hctx->type, this_cpu) != hctx) + return true; + /* + * Iterate over all online CPUs and manually check their mapping. + * We cannot use hctx->cpumask here because blk_mq_map_swqueue() + * intentionally strips isolated CPUs from it to prevent kworker + * routing. + */ + for_each_online_cpu(cpu) { + struct blk_mq_hw_ctx *h; + + if (cpu == this_cpu) + continue; + + h = blk_mq_map_queue_type(hctx->queue, hctx->type, cpu); + if (h != hctx) + continue; + + if (cpumask_test_cpu(cpu, hk_mask)) + return true; + + if (fallback_isolated_cpu == -1) + fallback_isolated_cpu = cpu; + } + + if (fallback_isolated_cpu != -1) { + pr_warn("blk-mq: trying to offline hctx%u but online isolated CPU %d is still mapped to it\n", + hctx->queue_num, fallback_isolated_cpu); + return false; + } + + return true; +} + static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx, unsigned int this_cpu) { @@ -3755,6 +3805,11 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) struct blk_mq_hw_ctx, cpuhp_online); int ret = 0; + if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) { + if (!blk_mq_hctx_can_offline_hk_cpu(hctx, cpu)) + return -EINVAL; + } + if (!hctx->nr_ctx || blk_mq_hctx_has_online_cpu(hctx, cpu)) return 0; From d790d5499684659709a54c04d1b422b878592787 Mon Sep 17 00:00:00 2001 From: Aaron Tomlin Date: Thu, 21 May 2026 19:29:55 -0400 Subject: [PATCH 7/8] genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs At present, the managed interrupt spreading algorithm distributes vectors across all available CPUs within a given node or system. On systems employing CPU isolation (e.g., "isolcpus=io_queue"), this behaviour defeats the primary purpose of isolation by routing hardware interrupts (such as NVMe completion queues) directly to isolated cores. Update irq_create_affinity_masks() to respect the housekeeping CPU mask. By passing the HK_TYPE_IO_QUEUE mask directly to the topological distribution function (group_mask_cpus_evenly()), we ensure that managed interrupts are kept strictly off isolated CPUs. This patch additionally addresses the architectural constraints of restricted vector distribution: 1. Vector Limits and Overrides: Updated irq_calc_affinity_vectors() to strictly bound the maximum number of allocated vectors to the weight of the housekeeping mask. This correctly overrides drivers providing a calc_sets() callback, preventing them from wasting memory on dead hardware queues that cannot be routed to isolated CPUs. 2. Multi-set Alignment and Leak Prevention: When isolation constraints result in fewer available masks than requested vectors for a given set, the remaining vector slots are padded with the housekeeping mask. This replaces the historical irq_default_affinity padding, ensuring excess managed queues do not leak interrupts onto isolated CPUs. 3. Minimum Vector Safety Net: To prevent fatal -ENOSPC device probe aborts on heavily isolated systems (where the housekeeping CPU count might be lower than a device's structural minimum), the final vector calculation is safeguarded to never drop below minvec. Queues will safely share the available housekeeping CPUs instead of failing the probe. 4. Zero Overhead: The housekeeping mask is conditionally assigned via a direct pointer, completely avoiding temporary mask allocations (e.g., alloc_cpumask_var) and bitwise operations when CPU isolation is disabled. This guarantees zero performance or memory overhead for standard configurations. Signed-off-by: Aaron Tomlin --- kernel/irq/affinity.c | 31 +++++++++++++++++++++++-------- 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 78f2418a89252..dade92f8b4b37 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -8,6 +8,7 @@ #include #include #include +#include static void default_calc_sets(struct irq_affinity *affd, unsigned int affvecs) { @@ -25,8 +26,10 @@ static void default_calc_sets(struct irq_affinity *affd, unsigned int affvecs) struct irq_affinity_desc * irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd) { - unsigned int affvecs, curvec, usedvecs, i; + unsigned int affvecs, curvec, usedvecs, i, j; struct irq_affinity_desc *masks = NULL; + const struct cpumask *hk_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE); + bool hk_enabled = housekeeping_enabled(HK_TYPE_IO_QUEUE); /* * Determine the number of vectors which need interrupt affinities @@ -70,19 +73,29 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd) */ for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) { unsigned int nr_masks, this_vecs = affd->set_size[i]; - struct cpumask *result = group_cpus_evenly(this_vecs, &nr_masks); + struct cpumask *result; + const struct cpumask *mask; + if (hk_enabled) + mask = hk_mask; + else + mask = cpu_possible_mask; + + result = group_mask_cpus_evenly(this_vecs, mask, + &nr_masks); if (!result) { kfree(masks); return NULL; } - - for (int j = 0; j < nr_masks; j++) + for (j = 0; j < nr_masks; j++) cpumask_copy(&masks[curvec + j].mask, &result[j]); + for (j = nr_masks; j < this_vecs; j++) + cpumask_copy(&masks[curvec + j].mask, mask); + kfree(result); - curvec += nr_masks; - usedvecs += nr_masks; + curvec += this_vecs; + usedvecs += this_vecs; } /* Fill out vectors at the end that don't need affinity */ @@ -115,10 +128,12 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, if (resv > minvec) return 0; - if (affd->calc_sets) + if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) + set_vecs = cpumask_weight(housekeeping_cpumask(HK_TYPE_IO_QUEUE)); + else if (affd->calc_sets) set_vecs = maxvec - resv; else set_vecs = cpumask_weight(cpu_possible_mask); - return resv + min(set_vecs, maxvec - resv); + return max(minvec, resv + min(set_vecs, maxvec - resv)); } From 5c030f481df6524ac571ab969f8fb8fb40d2aff4 Mon Sep 17 00:00:00 2001 From: Daniel Wagner Date: Thu, 21 May 2026 19:29:56 -0400 Subject: [PATCH 8/8] docs: add io_queue flag to isolcpus The io_queue flag informs multiqueue device drivers where to place hardware queues. Document this new flag in the isolcpus command-line argument description. Signed-off-by: Daniel Wagner Reviewed-by: Hannes Reinecke [atomlin: - Refined io_queue kernel parameter documentation - Removed an inaccurate claim in the documentation stating that io_queue takes precedence over managed_irq] Signed-off-by: Aaron Tomlin --- .../admin-guide/kernel-parameters.txt | 26 ++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 4d0f545fb3ec5..fb828bb60b9e8 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2792,7 +2792,6 @@ Kernel parameters "number of CPUs in system - 1". managed_irq - Isolate from being targeted by managed interrupts which have an interrupt mask containing isolated CPUs. The affinity of managed interrupts is @@ -2815,6 +2814,31 @@ Kernel parameters housekeeping CPUs has no influence on those queues. + io_queue + Applicable to managed IRQs only. Restrict + multiqueue hardware queue allocation to online + housekeeping CPUs. This guarantees that all + managed hardware completion interrupts are routed + exclusively to housekeeping cores, shielding + isolated CPUs from I/O interruptions even if they + initiated the request. + + Note: Using io_queue restricts the number of + allocated hardware queues to match the number of + housekeeping CPUs. This prevents MSI-X vector + exhaustion and forces isolated CPUs to share + submission queues. + + Note: Offlining housekeeping CPUs which serve + isolated CPUs will fail. The isolated CPUs must + be offlined before offlining the housekeeping + CPUs. + + Note: When I/O is submitted by an application on + an isolated CPU, the hardware completion + interrupt is handled entirely by a housekeeping + CPU. + The format of is described above. iucv= [HW,NET]