[ET-VK][conv1d] Implement height-packed pointwise conv1d operator#18332
[ET-VK][conv1d] Implement height-packed pointwise conv1d operator#18332meta-codesync[bot] merged 5 commits intogh/SS-JIA/494/basefrom
Conversation
Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18332
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit 3d5c848 with merge base 7c79395 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
…perator" Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/) [ghstack-poisoned]
…perator" Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/) [ghstack-poisoned]
…perator" Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/) [ghstack-poisoned]
…perator" Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/) [ghstack-poisoned]
f86d47a
into
gh/SS-JIA/494/base
Pull Request resolved: #18332 Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s ghstack-source-id: 358903218 @exported-using-ghexport Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/)
Pull Request resolved: pytorch#18332 Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s ghstack-source-id: 358903218 @exported-using-ghexport Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/)
Stack from ghstack (oldest at bottom):
Implement a new conv1d pointwise (kernel_size=1) operator using height-packed
layout where channels are the packed dimension (WHCN dim 1). This enables
dot-product reduction over input channels: each vec4 load gives 4 consecutive
channel values, yielding 4 MACs per dot() instruction.
Uses tiled computation with the FP tile infrastructure from linear/matmul
(FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and
4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for
cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows)
based on GPU occupancy.
Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes
TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed
weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA.
Supports both buffer and texture3d storage for input/output, texture2d or
buffer for packed weights, fp32/fp16, and optional bias. Registered as
et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking).
Performance on Adreno 750 (S24):
Differential Revision: D97344092