-
Notifications
You must be signed in to change notification settings - Fork 835
Remove unnecessary cuda sync for better perf #17315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/gasoonjia/116/base
Are you sure you want to change the base?
Conversation
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) ghstack-source-id: 339552916 Pull Request resolved: #17315
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17315
Note: Links to docs will display an error until the docs builds have been completed. ❌ 25 New Failures, 2 PendingAs of commit 69be4bb with merge base c09db81 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339642357 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339728013 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339777492 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339784126 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339788761 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339802040 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339914649 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #17315 * __->__ #17324 torchcodec we are using (0.10.0.dev20251211) has no longer existed in https://download.pytorch.org/whl/nightly/torchcodec/, which leads to lots of cis including all whisper cis crashed. this diff pin bump torchcodec to bring ci back. Differential Revision: [D92797044](https://our.internmc.facebook.com/intern/diff/D92797044/)
| if (is_using_shared_cuda_stream()) { | ||
| // Shared stream mode: set handle's stream to nullptr. | ||
| // The stream will be retrieved from backend in execute(). | ||
| handle->cuda_stream = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to set handle->cuda_stream to the only cuda stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on offline sync we tried to create a new CudaHandle class to inherit current aoti_handle, puting a shared_ptr<cuda_stream> inside cuda_handle, to make sure there's only one cuda stream in the whole pipeline.
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 340006830 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
) ### Summary - Currently all tests will push same libs, which is redundant. With this PR it only push once, and reduce execution time from two to three times for operator's tests. ### Test plan ``` python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNFloatingPointOperator --device <device> --host <host> --model SM8650 --build_folder build-android --executorch_root . --artifact all_artifact ``` without optimization <img width="429" height="136" alt="image" src="https://github.com/user-attachments/assets/f52a167c-b665-47da-97b2-02f836f4858e" /> with optimization <img width="442" height="130" alt="image" src="https://github.com/user-attachments/assets/46366237-ec7b-4d38-b577-5677b3dafb36" /> cc @cccclai @cbilgin
- Clean up operation is required to create an available device - CI is complete, but device is not aware of the situation so it should be done by themselves Signed-off-by: jiseong.oh <jiseong.oh@samsung.com>
### Summary Added message matchers to death tests in 2 test files to verify tests fail with the expected error messages, not just that they fail. evalue_test.cpp (18 matchers): - Type checks: "EValue is not an int", "EValue is not a" - Null pointer checks: "Pointer is null", "pointer cannot be null" - List pointer checks: "string/int/bool/double/tensor list pointer is null" - BoxedEvalueList checks: "wrapped_vals/unwrapped_vals cannot be null" tensor_util_test.cpp (29 matchers): - Shape/dtype mismatches: "Tensors do not match" - Dimension validation: "Ending/Starting dimension.*should be in the range" - Empty matchers for stride checks (Windows regex limitations) Note: Matchers use only cross-platform compatible regex features (no brackets, unions, or grouping which fail on Windows). ### Test plan ``` ./test/run_oss_cpp_tests.sh ```
### Summary
The validate_dim_order function only checked that values were in bounds,
allowing invalid inputs like {0, 0, 0} to pass. This caused
uninitialized memory access in dim_order_to_stride_nocheck.
Fix by using a bitmask to detect duplicates. Also adds test fixture with
runtime_init() for error logging and removes duplicate include.
### Test plan
```
./test/run_oss_cpp_tests.sh
```
---------
Co-authored-by: Claude <noreply@anthropic.com>
…nsors Pull Request resolved: #17309 Tensors sharing physical memory via SharedObject each track their own `last_access_` independently. When a tensor's first access is a write, `prev_stage` is `NO_STAGE`, causing `transition()` to use `TOP_OF_PIPE_BIT` as `srcStageMask` with no `srcAccessMask` — effectively a no-op barrier. If the same physical memory was previously written through a different aliased tensor handle, this creates a WAW hazard where the new write may execute before or concurrently with the prior write, producing non-deterministic results. This was observed as non-deterministic q8ta_conv2d output in ResNet50: running the model twice with the same input produced slightly different quantized int8 values. Adding a debug print shader after each conv2d dispatch masked the issue because the print node's read-after-write barrier serialized GPU work. The fix: when `prev_stage` is `NO_STAGE` and the current access is a write, use `COMPUTE_SHADER_BIT` with `SHADER_WRITE_BIT` instead of `TOP_OF_PIPE_BIT` with no access flags. This ensures all prior compute shader work completes and its writes are made visible before the new write begins. Authored with Claude. ghstack-source-id: 339884030 @exported-using-ghexport Differential Revision: [D92715369](https://our.internmc.facebook.com/intern/diff/D92715369/)
d1e60e1 to
7d06ab5
Compare
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340388616 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340451972 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
larryliu0820
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for splitting aoti_delegate_handle and cuda_delegate_handle, it's much cleaner this way.
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340657506 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Stack from ghstack (oldest at bottom):
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.
Differential Revision: D92193164