Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 15 additions & 9 deletions src/coreclr/jit/rationalize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -638,7 +638,8 @@ void Rationalizer::RewriteHWIntrinsicBlendv(GenTree** use, Compiler::GenTreeStac
return;
}

GenTree* op2 = node->Op(2);
GenTree* op2 = node->Op(2);
GenTree*& op3 = node->Op(3);

// We're in the post-order visit and are traversing in execution order, so
// everything between op2 and node will have already been rewritten to LIR
Expand All @@ -655,20 +656,25 @@ void Rationalizer::RewriteHWIntrinsicBlendv(GenTree** use, Compiler::GenTreeStac

if (op2->isEmbeddedMaskingCompatible(m_compiler, tgtMaskSize, tgtSimdBaseType))
{
// We are going to utilize the embedded mask, so we don't need to rewrite. However,
// we want to fixup the simdBaseType here since it simplifies lowering and allows
// both embedded broadcast and the mask to be live simultaneously.
// Make sure we had a mask to begin with. We don't want to create a mask
// solely for the purpose of embedding it.
Comment on lines +659 to +660
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an elaboration covering why as it's operating under the presumption that vpmov*2m is more expensive than vblendvps, which can depend on the hardware and may change in the future

For example, this is what we have today where it may be slower or may be similar perf (using XMM/YMM/ZMM and vpmov*2m vs vblendvps):

  • AMD Zen4: 3/4/5 vs 1
  • AMD Zen5: 3/4/6 vs 2
  • Intel Skylake-X: 3 vs 1-2
  • Intel Emerald Rapids: 3 vs 2-3

Then there is also the case where we don't have a vpblendvw and vpblendvb has slightly different behavior here, so it might not be valid to switch back. Consider that 0x8000 selects a whole word but needs to be 0x8080 if using vpblendvb.


if (tgtSimdBaseType != TYP_UNDEF)
if (!op3->OperIsHWIntrinsic() ||
(op3->AsHWIntrinsic()->GetHWIntrinsicId() != NI_AVX512_ConvertVectorToMask))
Comment on lines +662 to +663
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConvertVectorToMask check can be simplified/standardized by using the existing helper OperIsConvertVectorToMask() instead of checking OperIsHWIntrinsic() + GetHWIntrinsicId(). This avoids duplicating intrinsic IDs and reads more clearly.

Suggested change
if (!op3->OperIsHWIntrinsic() ||
(op3->AsHWIntrinsic()->GetHWIntrinsicId() != NI_AVX512_ConvertVectorToMask))
if (!op3->OperIsConvertVectorToMask())

Copilot uses AI. Check for mistakes.
{
Comment on lines +660 to 664
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes when BlendVariableMask is rewritten back to BlendVariable (and can eliminate a ConvertVectorToMask), but I couldn't find an existing JIT codegen test covering this scenario (no BlendVariableMask hits under src/tests). Consider adding an asm pattern test to prevent regressions for Vector128/256 cases.

Copilot uses AI. Check for mistakes.
op2->AsHWIntrinsic()->SetSimdBaseType(tgtSimdBaseType);
// We are going to utilize the embedded mask, so we don't need to rewrite. However,
// we want to fixup the simdBaseType here since it simplifies lowering and allows
// both embedded broadcast and the mask to be live simultaneously.

if (tgtSimdBaseType != TYP_UNDEF)
{
op2->AsHWIntrinsic()->SetSimdBaseType(tgtSimdBaseType);
}
return;
}
return;
}
}

GenTree*& op3 = node->Op(3);

if (!ShouldRewriteToNonMaskHWIntrinsic(op3))
{
return;
Expand Down
Loading