AVX swizzle broadcast and swap optimization#1213
Merged
serge-sans-paille merged 5 commits intoxtensor-stack:masterfrom Nov 20, 2025
Merged
AVX swizzle broadcast and swap optimization#1213serge-sans-paille merged 5 commits intoxtensor-stack:masterfrom
serge-sans-paille merged 5 commits intoxtensor-stack:masterfrom
Conversation
| constexpr batch_bool_constant<uint32_t, A, (V0 >= 4), (V1 >= 4), (V2 >= 4), (V3 >= 4), (V4 >= 4), (V5 >= 4), (V6 >= 4), (V7 >= 4)> lane_mask {}; | ||
| // select lane by the mask index divided by 4 | ||
| constexpr auto lane = batch_constant<uint32_t, A, 0, 0, 0, 0, 1, 1, 1, 1> {}; | ||
| constexpr int lane_idx = ((mask / make_batch_constant<uint32_t, 4, A>()) != lane).mask(); |
Contributor
There was a problem hiding this comment.
I have difficulties seeing how the former lane_mask = V_i >= 4 is equivalent to V_i / 4 != lane[i].
Why isn't that just lane_mask >= make_batch_constant<uint32_t, 4, A>() ?
Contributor
Author
There was a problem hiding this comment.
Because r0 and r1 do not contain the same values as before:
- before:
r0contains items from low in both lanes andr1contains items from high in both lanes - after: each
r0lane contains items from its lane while eachr1lane contains items from the other lane.
For instance, before a 0 in the second lane must be selected from r0 (low values) while after it must be selected from r1 (other lane).
Contributor
There was a problem hiding this comment.
Because
r0andr1do not contain the same values as before:* before: `r0` contains items from low in both lanes and `r1` contains items from high in both lanes * after: each `r0` lane contains items from its lane while each `r1` lane contains items from the other lane.For instance, before a
0in the second lane must be selected fromr0(low values) while after it must be selected fromr1(other lane).
and this saves a few permute, perfect!
| constexpr batch_bool_constant<uint64_t, A, (V0 >= 2), (V1 >= 2), (V2 >= 2), (V3 >= 2)> blend_mask; | ||
| // select lane by the mask index divided by 2 | ||
| constexpr auto lane = batch_constant<uint64_t, A, 0, 0, 1, 1> {}; | ||
| constexpr int lane_idx = ((mask / make_batch_constant<uint64_t, 2, A>()) != lane).mask(); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Similar optimizations to #1201 but for AVX