[ET-VK][conv1d] Implement height-packed pointwise conv1d operator by SS-JIA · Pull Request #18332 · pytorch/executorch

SS-JIA · 2026-03-19T19:14:59Z

Stack from ghstack (oldest at bottom):

Implement a new conv1d pointwise (kernel_size=1) operator using height-packed
layout where channels are the packed dimension (WHCN dim 1). This enables
dot-product reduction over input channels: each vec4 load gives 4 consecutive
channel values, yielding 4 MACs per dot() instruction.

Uses tiled computation with the FP tile infrastructure from linear/matmul
(FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and
4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for
cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows)
based on GPU occupancy.

Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes
TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed
weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA.

Supports both buffer and texture3d storage for input/output, texture2d or
buffer for packed weights, fp32/fp16, and optional bias. Registered as
et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking).

Performance on Adreno 750 (S24):

[1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s
[1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s
[1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s
[1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s

Differential Revision: D97344092

Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/) [ghstack-poisoned]

pytorch-bot · 2026-03-19T19:15:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18332

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 3d5c848 with merge base 7c79395 ():

NEW FAILURE - The following job has failed:

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
RuntimeError: Command docker exec -t a89512dadccea090db30322a5dddf630ab31b81c44c5e7ae941d793721f52d7b /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-19T19:19:01Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…perator" Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/) [ghstack-poisoned]

Pull Request resolved: #18332 Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s ghstack-source-id: 358903218 @exported-using-ghexport Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/)

Pull Request resolved: pytorch#18332 Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s ghstack-source-id: 358903218 @exported-using-ghexport Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/)

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2026

This was referenced Mar 19, 2026

[ET-VK][conv1d] Implement height-packed depthwise conv1d operator #18333

Merged

[ET-VK][conv1d] Route conv1d to height-packed implementations in export pipeline #18334

Merged

[ET-VK][CI] Add test-vulkan-genai job for Parakeet on NVIDIA GPU runner #18335

Open

meta-codesync bot added fb-exported meta-exported labels Mar 19, 2026

ssjia added 2 commits March 19, 2026 15:48

trviv approved these changes Mar 23, 2026

View reviewed changes

ssjia added 2 commits March 27, 2026 10:40

meta-codesync bot merged commit f86d47a into gh/SS-JIA/494/base Mar 27, 2026
156 of 162 checks passed

meta-codesync bot deleted the gh/SS-JIA/494/head branch March 27, 2026 22:00

meta-codesync bot temporarily deployed to cherry-pick-bot March 27, 2026 22:00 Inactive

pytorchbot mentioned this pull request Mar 27, 2026

[ET-VK][conv1d] Implement height-packed pointwise conv1d operator #18547

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][conv1d] Implement height-packed pointwise conv1d operator#18332

[ET-VK][conv1d] Implement height-packed pointwise conv1d operator#18332
meta-codesync[bot] merged 5 commits intogh/SS-JIA/494/basefrom
gh/SS-JIA/494/head

SS-JIA commented Mar 19, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18332

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

github-actions bot commented Mar 19, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Mar 19, 2026 •

edited

Loading

pytorch-bot bot commented Mar 19, 2026 •

edited

Loading

This PR needs a `release notes:` label