Skip to content

Conversation

@mdboom
Copy link
Contributor

@mdboom mdboom commented Jan 30, 2026

See #659 for a discussion of this solution.

There weren't as many instances of this pattern as I had hoped, but nonetheless, for functions where it is used, we should see ~25% reduction in Python-to-C overhead.

Keeping this as a draft for testing while the generator changes get reviewed.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 30, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mdboom
Copy link
Contributor Author

mdboom commented Jan 30, 2026

/ok to test

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a performance optimization to reduce Python-to-C overhead in the CUDA bindings by statically allocating small arrays (≤5 elements) instead of using dynamic allocation. This addresses the performance gap reported in issue #659, where CUDA API calls through cuda-bindings were observed to be 3x slower than direct CUDA C++ API calls.

Changes:

  • Replaced dynamic memory allocation (calloc/free) with static stack allocation for array parameters with 5 or fewer elements in tensor map encoding functions
  • Removed explicit type validation checks that were previously performed before array processing
  • Simplified memory management by eliminating the need for explicit free() calls for small arrays

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +49981 to +49994
cdef cydriver.cuuint32_t* cyelementStrides
cdef size_t elementStridesLen
cdef cydriver.cuuint32_t[5] elementStridesStatic
elementStridesLen = 0 if elementStrides is None else len(elementStrides)
if elementStridesLen == 0:
cyelementStrides = NULL
elif elementStridesLen == 1:
cyelementStrides = <cydriver.cuuint32_t *>(<cuuint32_t?> elementStrides[0])._pvt_ptr
elif elementStridesLen <= 5:
for idx in range(elementStridesLen):
elementStridesStatic[idx] = <cydriver.cuuint32_t>(<cuuint32_t?> elementStrides[idx])._pvt_ptr[0]
cyelementStrides = elementStridesStatic
else:
raise ValueError("Argument 'elementStrides' too long, must be <= 5")
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of type validation for the elementStrides parameter changes the error handling behavior. The old code explicitly checked that all elements were cuuint32_t instances and raised a clear TypeError. The new code relies on Cython's type casting, which may produce less clear error messages if invalid types are passed. Consider adding explicit type validation before the casting logic to maintain clear error messages for users.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think that the Cython checking is sufficient -- it will raise an exception in the same cases as before, but is faster and non-duplicative.

@github-actions
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant