Skip to content

Conversation

@Gasoonjia
Copy link
Contributor

@Gasoonjia Gasoonjia commented Feb 9, 2026

Stack from ghstack (oldest at bottom):

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: D92193164

Gasoonjia added a commit that referenced this pull request Feb 9, 2026
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

ghstack-source-id: 339552916
Pull Request resolved: #17315
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 9, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17315

Note: Links to docs will display an error until the docs builds have been completed.

❌ 25 New Failures, 2 Pending

As of commit 69be4bb with merge base c09db81 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Gasoonjia added a commit that referenced this pull request Feb 9, 2026
Pull Request resolved: #17315


Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.
ghstack-source-id: 339642357
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Gasoonjia added a commit that referenced this pull request Feb 10, 2026
Pull Request resolved: #17315


Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.
ghstack-source-id: 339728013
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
@Gasoonjia Gasoonjia mentioned this pull request Feb 10, 2026
Gasoonjia added a commit that referenced this pull request Feb 10, 2026
Pull Request resolved: #17315


Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.
ghstack-source-id: 339777492
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Gasoonjia added a commit that referenced this pull request Feb 10, 2026
Pull Request resolved: #17315


Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.
ghstack-source-id: 339784126
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Gasoonjia added a commit that referenced this pull request Feb 10, 2026
Pull Request resolved: #17315


Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.
ghstack-source-id: 339788761
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Gasoonjia added a commit that referenced this pull request Feb 10, 2026
Pull Request resolved: #17315


Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.
ghstack-source-id: 339802040
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Gasoonjia added a commit that referenced this pull request Feb 10, 2026
Pull Request resolved: #17315


Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.
ghstack-source-id: 339914649
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Gasoonjia added a commit that referenced this pull request Feb 10, 2026
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #17315
* __->__ #17324

torchcodec we are using (0.10.0.dev20251211) has no longer existed in
https://download.pytorch.org/whl/nightly/torchcodec/, which leads to
lots of cis including all whisper cis crashed. this diff pin bump
torchcodec to bring ci back.

Differential Revision:
[D92797044](https://our.internmc.facebook.com/intern/diff/D92797044/)
if (is_using_shared_cuda_stream()) {
// Shared stream mode: set handle's stream to nullptr.
// The stream will be retrieved from backend in execute().
handle->cuda_stream = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to set handle->cuda_stream to the only cuda stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on offline sync we tried to create a new CudaHandle class to inherit current aoti_handle, puting a shared_ptr<cuda_stream> inside cuda_handle, to make sure there's only one cuda stream in the whole pipeline.

Gasoonjia added a commit that referenced this pull request Feb 10, 2026
Pull Request resolved: #17315


Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.
ghstack-source-id: 340006830
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results February 11, 2026 11:12 — with GitHub Actions Inactive
Gasoonjia and others added 11 commits February 11, 2026 08:35
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

[ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

[ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

[ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

[ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

[ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

[ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

[ghstack-poisoned]
)

### Summary
- Currently all tests will push same libs, which is redundant. With this
PR it only push once, and reduce execution time from two to three times
for operator's tests.

### Test plan
```
python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNFloatingPointOperator --device <device> --host <host> --model SM8650 --build_folder build-android --executorch_root . --artifact all_artifact
```
without optimization
<img width="429" height="136" alt="image"
src="https://github.com/user-attachments/assets/f52a167c-b665-47da-97b2-02f836f4858e"
/>

with optimization
<img width="442" height="130" alt="image"
src="https://github.com/user-attachments/assets/46366237-ec7b-4d38-b577-5677b3dafb36"
/>



cc @cccclai @cbilgin
- Clean up operation is required to create an available device
- CI is complete, but device is not aware of the situation so
  it should be done by themselves

Signed-off-by: jiseong.oh <jiseong.oh@samsung.com>
### Summary
Added message matchers to death tests in 2 test files to verify
tests fail with the expected error messages, not just that they fail.

evalue_test.cpp (18 matchers):
- Type checks: "EValue is not an int", "EValue is not a"
- Null pointer checks: "Pointer is null", "pointer cannot be null"
- List pointer checks: "string/int/bool/double/tensor list pointer is
null"
- BoxedEvalueList checks: "wrapped_vals/unwrapped_vals cannot be null"

tensor_util_test.cpp (29 matchers):
- Shape/dtype mismatches: "Tensors do not match"
- Dimension validation: "Ending/Starting dimension.*should be in the
range"
- Empty matchers for stride checks (Windows regex limitations)

Note: Matchers use only cross-platform compatible regex features
(no brackets, unions, or grouping which fail on Windows).


### Test plan
```
./test/run_oss_cpp_tests.sh
```
### Summary
The validate_dim_order function only checked that values were in bounds,
allowing invalid inputs like {0, 0, 0} to pass. This caused
uninitialized memory access in dim_order_to_stride_nocheck.

Fix by using a bitmask to detect duplicates. Also adds test fixture with
runtime_init() for error logging and removes duplicate include.

### Test plan
```
./test/run_oss_cpp_tests.sh
```

---------

Co-authored-by: Claude <noreply@anthropic.com>
ssjia and others added 5 commits February 11, 2026 08:35
…nsors

Pull Request resolved: #17309

Tensors sharing physical memory via SharedObject each track their own
`last_access_` independently. When a tensor's first access is a write,
`prev_stage` is `NO_STAGE`, causing `transition()` to use
`TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively
a no-op barrier. If the same physical memory was previously written
through a different aliased tensor handle, this creates a WAW hazard
where the new write may execute before or concurrently with the prior
write, producing non-deterministic results.

This was observed as non-deterministic q8ta_conv2d output in ResNet50:
running the model twice with the same input produced slightly different
quantized int8 values. Adding a debug print shader after each conv2d
dispatch masked the issue because the print node's read-after-write
barrier serialized GPU work.

The fix: when `prev_stage` is `NO_STAGE` and the current access is a
write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of
`TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute
shader work completes and its writes are made visible before the new
write begins.

Authored with Claude.
ghstack-source-id: 339884030
@exported-using-ghexport

Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Feb 11, 2026
Pull Request resolved: #17315

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy.

ghstack-source-id: 340388616
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results February 11, 2026 20:49 — with GitHub Actions Inactive
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Feb 11, 2026
Pull Request resolved: #17315

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy.

ghstack-source-id: 340451972
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results February 11, 2026 23:29 — with GitHub Actions Inactive
Copy link
Contributor

@larryliu0820 larryliu0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for splitting aoti_delegate_handle and cuda_delegate_handle, it's much cleaner this way.

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Feb 12, 2026
Pull Request resolved: #17315

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy.

ghstack-source-id: 340657506
@exported-using-ghexport

Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants