-
Notifications
You must be signed in to change notification settings - Fork 242
Statically-allocate small arrays of numbers #1545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a performance optimization to reduce Python-to-C overhead in the CUDA bindings by statically allocating small arrays (≤5 elements) instead of using dynamic allocation. This addresses the performance gap reported in issue #659, where CUDA API calls through cuda-bindings were observed to be 3x slower than direct CUDA C++ API calls.
Changes:
- Replaced dynamic memory allocation (calloc/free) with static stack allocation for array parameters with 5 or fewer elements in tensor map encoding functions
- Removed explicit type validation checks that were previously performed before array processing
- Simplified memory management by eliminating the need for explicit free() calls for small arrays
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cdef cydriver.cuuint32_t* cyelementStrides | ||
| cdef size_t elementStridesLen | ||
| cdef cydriver.cuuint32_t[5] elementStridesStatic | ||
| elementStridesLen = 0 if elementStrides is None else len(elementStrides) | ||
| if elementStridesLen == 0: | ||
| cyelementStrides = NULL | ||
| elif elementStridesLen == 1: | ||
| cyelementStrides = <cydriver.cuuint32_t *>(<cuuint32_t?> elementStrides[0])._pvt_ptr | ||
| elif elementStridesLen <= 5: | ||
| for idx in range(elementStridesLen): | ||
| elementStridesStatic[idx] = <cydriver.cuuint32_t>(<cuuint32_t?> elementStrides[idx])._pvt_ptr[0] | ||
| cyelementStrides = elementStridesStatic | ||
| else: | ||
| raise ValueError("Argument 'elementStrides' too long, must be <= 5") |
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of type validation for the elementStrides parameter changes the error handling behavior. The old code explicitly checked that all elements were cuuint32_t instances and raised a clear TypeError. The new code relies on Cython's type casting, which may produce less clear error messages if invalid types are passed. Consider adding explicit type validation before the casting logic to maintain clear error messages for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I think that the Cython checking is sufficient -- it will raise an exception in the same cases as before, but is faster and non-duplicative.
|
See #659 for a discussion of this solution.
There weren't as many instances of this pattern as I had hoped, but nonetheless, for functions where it is used, we should see ~25% reduction in Python-to-C overhead.
Keeping this as a draft for testing while the generator changes get reviewed.