Remove unnecessary cuda sync for better perf #17315

Gasoonjia · 2026-02-09T20:02:55Z

Stack from ghstack (oldest at bottom):

-> Remove unnecessary cuda sync for better perf #17315

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: D92193164

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) ghstack-source-id: 339552916 Pull Request resolved: #17315

pytorch-bot · 2026-02-09T20:03:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17315

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 25 New Failures, 2 Pending

As of commit 69be4bb with merge base c09db81 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
Test CUDA Builds / test-cuda-pybind (gemma3-4b, --quantize) / linux-job (gh)
RuntimeError: Command docker exec -t bc3f0b842a50752a01e6b1ab88612a87360f34f7d5ea968d412cea5022d03eac /exec failed with exit code 1
Test CUDA Builds / test-cuda-pybind (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t c6e56e8bf4fb6b925bd63254cf1e974db1f84ad5a62ce80a97242fe656978992 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t e3719772d9c87c44547753fbafc6904b2c24a998a209cdbea6ba0d4a8c97d546 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t dcd24c1218aaade4a2cc970c5208be4e2e2b7eb6f8288926b275e6b1f7cb77cd /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 383380dbf28f8cd7e2bc55b6f17a13dec39fb59cb293175753f87230268c2062 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t aef16c90c07ae3dae05a8b7df59810b7d13b8baf248d8ef14e7ff6b703176400 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 8b23b7fe02724561032bbd917d70c0a02c1afe8a09da26a0c3e7af2fe30f00d0 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 5963fee5fd147d859eddba876fbfc4de30c5ce1efbe50a8a1cc1d77f2961a156 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 130c7f2654eb8bb2b2a455ac2cf9b916e9b78d7de9ffecb9a2cbfe011a233f5d /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t d064ae81dfea89c56decd5ae9ed56f16172fdb733e4113d996c39f0915741fd3 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t ccb34b8104a2e1c0020b4c64b016b8e2a7253c6cc54c063e8bbcc6f17ea29842 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t b9862534b085c77a816ad65b350b22347a77787069e5e9e93d7e88e6b44a37bd /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 2572a1237b7e879f35fb6139c1b7fda5d4b815fbd6b9b870f9707491a3695b98 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t e7c57a14251068e6f2f730924fb43a40a02a4391a106865fb61fdb992db57031 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t af8663018e7162c3230540bf4d00686d6cbb774cb1846ed78b2027c9ac7fe541 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t b84eb8faadaacda0b1648ebb9797e17ecfb73296b0355aec73a2dfccc631db91 /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (add_mul) / linux-job (gh)
RuntimeError: Command docker exec -t 7320f43d653cbd5896bad1773b1826b1afd2f99c20103d1f43ef7b433f15fc2e /exec failed with exit code 134
Test CUDA Builds / test-models-cuda (add) / linux-job (gh)
RuntimeError: Command docker exec -t eb17c5afcd4a63ffd269c2e7e1f08945befe0572b29c06e2a8a6ad7ae6f22a73 /exec failed with exit code 134
Test CUDA Builds / test-models-cuda (conv1d) / linux-job (gh)
RuntimeError: Command docker exec -t 53af1071fd66cd159ea5dcb61c32c86a7dda8aba4823e6c5615777331daafeac /exec failed with exit code 134
Test CUDA Builds / test-models-cuda (linear) / linux-job (gh)
RuntimeError: Command docker exec -t bf88e73c40c64dd0b9c74b4db94e2df29a26e508613c64fea71f511386842f5b /exec failed with exit code 134
Test CUDA Builds / test-models-cuda (mv2) / linux-job (gh)
RuntimeError: Command docker exec -t c4cc170c1b16fc9f6a6fc571a6f582f20e63b2a302e879be59a3055b4b08a042 /exec failed with exit code 134
Test CUDA Builds / test-models-cuda (mv3) / linux-job (gh)
RuntimeError: Command docker exec -t a383ff6034497112c616e7d8c015715d0c19de65b7783000bf790726837cee04 /exec failed with exit code 134
Test CUDA Builds / test-models-cuda (resnet18) / linux-job (gh)
RuntimeError: Command docker exec -t d86683b0489adf3345e96387723f3f405fcc4afc21d7e1ba0835954d70b6ae39 /exec failed with exit code 134
Test CUDA Builds / test-models-cuda (sdpa) / linux-job (gh)
RuntimeError: Command docker exec -t b850c3e79aba9f4707e21dacbbc062dbeaedd88f12ea423e67f3263305f4c616 /exec failed with exit code 134

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-09T20:03:47Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339642357 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339728013 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339777492 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339784126 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339788761 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339802040 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339914649 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #17315 * __->__ #17324 torchcodec we are using (0.10.0.dev20251211) has no longer existed in https://download.pytorch.org/whl/nightly/torchcodec/, which leads to lots of cis including all whisper cis crashed. this diff pin bump torchcodec to bring ci back. Differential Revision: [D92797044](https://our.internmc.facebook.com/intern/diff/D92797044/)

backends/aoti/slim/core/storage.h

larryliu0820 · 2026-02-10T19:38:04Z

backends/cuda/runtime/cuda_backend.cpp

+    if (is_using_shared_cuda_stream()) {
+      // Shared stream mode: set handle's stream to nullptr.
+      // The stream will be retrieved from backend in execute().
+      handle->cuda_stream = nullptr;


I think it's better to set handle->cuda_stream to the only cuda stream.

based on offline sync we tried to create a new CudaHandle class to inherit current aoti_handle, puting a shared_ptr<cuda_stream> inside cuda_handle, to make sure there's only one cuda stream in the whole pipeline.

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 340006830 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

@cccclai

) ### Summary - Currently all tests will push same libs, which is redundant. With this PR it only push once, and reduce execution time from two to three times for operator's tests. ### Test plan ``` python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNFloatingPointOperator --device <device> --host <host> --model SM8650 --build_folder build-android --executorch_root . --artifact all_artifact ``` without optimization <img width="429" height="136" alt="image" src="https://github.com/user-attachments/assets/f52a167c-b665-47da-97b2-02f836f4858e" /> with optimization <img width="442" height="130" alt="image" src="https://github.com/user-attachments/assets/46366237-ec7b-4d38-b577-5677b3dafb36" /> cc @cccclai @cbilgin

- Clean up operation is required to create an available device - CI is complete, but device is not aware of the situation so it should be done by themselves Signed-off-by: jiseong.oh <jiseong.oh@samsung.com>

### Summary Added message matchers to death tests in 2 test files to verify tests fail with the expected error messages, not just that they fail. evalue_test.cpp (18 matchers): - Type checks: "EValue is not an int", "EValue is not a" - Null pointer checks: "Pointer is null", "pointer cannot be null" - List pointer checks: "string/int/bool/double/tensor list pointer is null" - BoxedEvalueList checks: "wrapped_vals/unwrapped_vals cannot be null" tensor_util_test.cpp (29 matchers): - Shape/dtype mismatches: "Tensors do not match" - Dimension validation: "Ending/Starting dimension.*should be in the range" - Empty matchers for stride checks (Windows regex limitations) Note: Matchers use only cross-platform compatible regex features (no brackets, unions, or grouping which fail on Windows). ### Test plan ``` ./test/run_oss_cpp_tests.sh ```

### Summary The validate_dim_order function only checked that values were in bounds, allowing invalid inputs like {0, 0, 0} to pass. This caused uninitialized memory access in dim_order_to_stride_nocheck. Fix by using a bitmask to detect duplicates. Also adds test fixture with runtime_init() for error logging and removes duplicate include. ### Test plan ``` ./test/run_oss_cpp_tests.sh ``` --------- Co-authored-by: Claude <noreply@anthropic.com>

…nsors Pull Request resolved: #17309 Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. ghstack-source-id: 339884030 @exported-using-ghexport Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340388616 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340451972 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

larryliu0820

Thank you for splitting aoti_delegate_handle and cuda_delegate_handle, it's much cleaner this way.

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340657506 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 9, 2026

Gasoonjia mentioned this pull request Feb 10, 2026

ping bump torchcodec #17324

Merged

larryliu0820 reviewed Feb 10, 2026

View reviewed changes

backends/aoti/slim/core/storage.h Show resolved Hide resolved

larryliu0820 reviewed Feb 10, 2026

View reviewed changes

Gasoonjia temporarily deployed to upload-benchmark-results February 11, 2026 11:12 — with GitHub Actions Inactive

Gasoonjia and others added 11 commits February 11, 2026 08:35

fix to allocate samsung device issue

a5c5f70

- Clean up operation is required to create an available device - CI is complete, but device is not aware of the situation so it should be done by themselves Signed-off-by: jiseong.oh <jiseong.oh@samsung.com>

ssjia and others added 5 commits February 11, 2026 08:35

consolidate cuda stream

aed50d1

reformat

bddcb88

rename for better clarification

6916192

rebase to latest main

7d06ab5

Gasoonjia force-pushed the gh/gasoonjia/116/head branch from d1e60e1 to 7d06ab5 Compare February 11, 2026 19:09

Gasoonjia requested review from JacobSzwejbka, SS-JIA, cccclai and lucylq as code owners February 11, 2026 19:09

Gasoonjia temporarily deployed to upload-benchmark-results February 11, 2026 20:49 — with GitHub Actions Inactive

Gasoonjia temporarily deployed to upload-benchmark-results February 11, 2026 23:29 — with GitHub Actions Inactive

larryliu0820 approved these changes Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary cuda sync for better perf #17315

Remove unnecessary cuda sync for better perf #17315

Gasoonjia commented Feb 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

Uh oh!

larryliu0820 Feb 10, 2026

Uh oh!

Gasoonjia Feb 10, 2026

Uh oh!

larryliu0820 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Remove unnecessary cuda sync for better perf #17315

Are you sure you want to change the base?

Remove unnecessary cuda sync for better perf #17315

Conversation

Gasoonjia commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17315

❌ 25 New Failures, 2 Pending

Uh oh!

github-actions bot commented Feb 9, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

larryliu0820 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

larryliu0820 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Gasoonjia commented Feb 9, 2026 •

edited

Loading

pytorch-bot bot commented Feb 9, 2026 •

edited

Loading

This PR needs a `release notes:` label