[ET-VK][conv1d] Implement height-packed pointwise conv1d operator by pytorchbot · Pull Request #18547 · pytorch/executorch

pytorchbot · 2026-03-27T22:00:55Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #18332 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/494/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/494/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/494/orig
Differential Revision: D97344092
@diff-train-skip-merge

Pull Request resolved: #18332 Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s ghstack-source-id: 358903218 @exported-using-ghexport Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/)

pytorch-bot · 2026-03-27T22:00:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18547

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 14 Pending, 2 Unrelated Failures

As of commit 8d23481 with merge base 7c79395 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-mypy (gh)
>>> Lint for backends/arm/test/models/stable_diffusion/stable_diffusion_module_test_configs.py:
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t a35b10d72455a48d83ed2a3ac3eb87eca3f8c527995cce512df54febbc4bc4fa /exec failed with exit code 139

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / unittest / linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest-buck / linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-27T22:01:42Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Pull Request resolved: #18333 Implement a depthwise conv1d operator using height-packed layout where channels are the packed dimension (WHCN dim 1). Depthwise conv applies a separate filter to each channel independently (groups=C), so 4 channels can be processed in parallel using element-wise vec4 FMA over kernel positions. Thread mapping: X=C/4, Y=L_out, Z=N. Each thread computes one output texel (4 channels at one spatial position). Inner loop iterates over kernel positions K with bounds-checked input access for padding. Weight [C,1,K] is prepacked as channels-packed so each vec4 load gives 4 channels' weights at one kernel position. Supports both buffer and texture3d storage, fp32/fp16, optional bias, and arbitrary stride/padding/dilation. Registered as et_vk.conv1d_dw.default (standalone custom op). Performance on Adreno 750 (S24): - [1,128,4096] K=31 buffer f16: 231 GFLOP/s - [1,128,4096] K=31 buffer f32: 155 GFLOP/s - [1,512,2048] K=5 buffer f32: 66 GFLOP/s ghstack-source-id: 358903219 @exported-using-ghexport Differential Revision: [D97344091](https://our.internmc.facebook.com/intern/diff/D97344091/)

…rt pipeline Pull Request resolved: #18334 Integrate the new height-packed conv1d_pw and conv1d_dw operators into the aten.convolution.default dispatch path so they are automatically used during model export. In op_registry.py, add a pick_conv_storage function that inspects the convolution node at partition time. For 1D convolutions where the op is pointwise (kernel_size=1) or depthwise (groups=C_in) and channels are 4-aligned, it selects HEIGHT_PACKED_TEXTURE for input/output instead of the default CHANNELS_PACKED_TEXTURE. All other cases (conv2d, grouped conv1d with K>1, unaligned channels) retain channels-packed behavior. In Convolution.cpp, add a height-packed routing block at the top of the conv1d path. When the input tensor is height-packed, it dispatches to et_vk.conv1d_pw.default or et_vk.conv1d_dw.default via VK_GET_OP_FN. Falls through to the existing channels-packed add_conv1d_node path otherwise. ghstack-source-id: 358903217 @exported-using-ghexport Differential Revision: [D97344090](https://our.internmc.facebook.com/intern/diff/D97344090/)

pytorchbot requested a review from SS-JIA as a code owner March 27, 2026 22:00

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 27, 2026

ssjia added 2 commits March 27, 2026 18:21

SS-JIA approved these changes Mar 27, 2026

View reviewed changes

SS-JIA merged commit 24751f1 into main Mar 27, 2026
157 of 164 checks passed

SS-JIA deleted the gh/SS-JIA/494/orig branch March 27, 2026 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][conv1d] Implement height-packed pointwise conv1d operator#18547

[ET-VK][conv1d] Implement height-packed pointwise conv1d operator#18547
SS-JIA merged 3 commits intomainfrom
gh/SS-JIA/494/orig

pytorchbot commented Mar 27, 2026

Uh oh!

pytorch-bot bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pytorchbot commented Mar 27, 2026

Uh oh!

pytorch-bot bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18547

❌ 2 New Failures, 14 Pending, 2 Unrelated Failures

Uh oh!

github-actions bot commented Mar 27, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Mar 27, 2026 •

edited

Loading

This PR needs a `release notes:` label