-
Notifications
You must be signed in to change notification settings - Fork 0
Add CUDA Backend Support #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add AbstractTypedPool{T,V} and AbstractArrayPool abstract types
- Make TypedPool and AdaptiveArrayPool inherit from abstract types
- Add dispatch points: allocate_vector(), wrap_array() for GPU backends
- Add Val-based backend dispatch: _get_pool_for_backend(::Val{:backend})
- Generalize get_view!, get_nd_array!, get_nd_view! to AbstractTypedPool
- Generalize state functions to work with any AbstractTypedPool
- Export abstract types for extension subtyping
This enables GPU extensions (CUDA, Metal) to reuse 95%+ of the pool
logic by only implementing allocation/wrapping dispatch methods.
Phase 2a+2b implementation:
- Add CuTypedPool{T} (no view caching - GPU views return CuArray)
- Add CuAdaptiveArrayPool with Float16 slot and device_id tracking
- Implement allocate_vector, wrap_array, get_typed_pool! dispatches
- Implement GPU-specific get_view! (fresh views each call, O(1) metadata)
- Add checkpoint auto-init for dynamic types in others fallback
- Configure package extension via weakdeps/extensions in Project.toml
- Add verification scripts for CUDA behavior and extension tests
- Add get_task_local_cuda_pool() with multi-device Dict{Int, Pool} storage
- Add get_task_local_cuda_pools() for diagnostic access
- Implement checkpoint!/rewind!/reset!/empty! for CuAdaptiveArrayPool
- Add foreach_fixed_slot for GPU pool iteration
- Add empty! for CuTypedPool (no views field unlike CPU)
- Support type-specific checkpoint/rewind variants
- Add backend-specific @with_pool macro variants using Val{:backend} dispatch
- Register :cuda backend via _get_pool_for_backend(::Val{:cuda})
- Add explicit @with_cuda_pool macro as alias
- Change all acquire functions to use AbstractArrayPool for extensibility
- _mark_untracked!, _acquire_impl!, _unsafe_acquire_impl!
- acquire!, unsafe_acquire! and all variants
- Add test script for Phase 2d verification
Enables:
@with_pool :cuda pool begin ... end
@with_cuda_pool pool begin ... end
Nested CPU/GPU pools
- Add CUDA dependency to test/Project.toml for extension loading - Change CUDA test logic from opt-in (TEST_CUDA=true) to auto-detect - Use TEST_CUDA=false to explicitly skip CUDA tests when needed - Downgrade warnings to info messages for non-error skip conditions
Add typed checkpoint/rewind optimization to _generate_pool_code_with_backend, matching the optimization already present in regular @with_pool. This enables @with_pool :cuda to use fast typed operations when all acquire! types are statically known.
Add _generate_function_pool_code_with_backend to properly handle function definition syntax for backend-specific pool macros: @with_pool :cuda pool function f(x) ... end Previously, the macro only worked with block form. Now both forms correctly wrap the function body (not the definition) with pool operations (checkpoint/rewind). Also adds comprehensive test suite (94 tests) for backend macro expansion that verifies correct code generation without requiring actual CUDA installation.
…h_pool :cuda - Remove redundant @with_cuda_pool macro alias (users should use @with_pool :cuda) - Improve backend error message for unavailable backends - Add coverage tests for CUDA extension state management: - Multi-type checkpoint/rewind - Type-specific reset - Rewind at depth=1 edge cases - State operations with rare types (pool.others) - get_task_local_cuda_pools before pool creation
- Unify get_view! to handle all dimensions (1D, 2D, 3D, etc.) with single cache
- Achieve 0 bytes CPU allocation on cache hit for acquire!
- get_view!(n::Int) delegates to get_view!((n,)) for API consistency
- Add get_nd_view! override that delegates to unified get_view!
- Cache stores CuArray{T,N} for any N using Vector{Any} with type assertions
- GPU view()/reshape() return CuArray (not SubArray/ReshapedArray like CPU)
- Remove get_nd_array! implementation (80 bytes overhead) - Remove nd_arrays, nd_dims, nd_ptrs, nd_next_way fields from CuTypedPool - get_view! handles all dimensions with 0 bytes CPU alloc on cache hit - Simplify CuTypedPool struct: only vectors, views, view_dims needed - Update empty!() to match simplified struct
…ire! compatibility
- Add 4-way cache per slot (CUDA_CACHE_WAYS=4) for multiple dimension patterns - Implement round-robin cache replacement with next_way counter - Add resize-to-fit: backing vectors grow or shrink to match requested size - Add cache invalidation on resize (all ways) to prevent stale view references - Document CUDA.jl's internal 25% shrink threshold behavior - Update types.jl with next_way field and N-way cache layout docs
- Create test/cuda/runtests.jl as entry point with separated availability check - Move test_cuda_extension.jl to test/cuda/test_extension.jl - Update test/runtests.jl to include cuda/runtests.jl - Fix P1: CUDA test failures no longer swallowed by try/catch The availability check is now in try/catch, but test execution is outside, ensuring failures properly propagate.
- Add get_task_local_cuda_pool/get_task_local_cuda_pools stubs to main module - Extension now overrides stubs instead of defining new functions - Update docstrings for acquire!/unsafe_acquire! to be backend-agnostic - Simplify test/cuda/runtests.jl (functions now via dispatch, not extension) Users can now `using AdaptiveArrayPools` and call CUDA functions directly when CUDA.jl is loaded, without accessing extension module.
- Add pool_stats and Base.show methods for CuTypedPool, CuAdaptiveArrayPool - Add symbol dispatch: pool_stats(:cpu), pool_stats(:cuda) - pool_stats() now shows all pools (CPU + CUDA if loaded) - Rename terminology: arrays/vectors → slots for clarity - Simplify output format (remove unicode box drawing) - Use Base.format_bytes instead of custom _format_bytes - Add return nothing to all pool_stats functions
- Add test_allocation.jl: GPU memory reuse, pointer verification, resize behavior - Add test_nway_cache.jl: N-way cache verification (4-way hit=0, 5-way miss>0) - Add test_display.jl: pool_stats and Base.show for CuTypedPool/CuAdaptiveArrayPool - Update runtests.jl to include new test modules Key test principles: - GPU allocation should ALWAYS be 0 (memory reused from pool) - CPU allocation: cache hit (4-way) = 0, cache miss (5-way) = >0 - Separate GPU tests (with fill!) from CPU tests (without fill! to avoid kernel overhead)
- README: Rewritten with clear "The Problem" → "The Solution" structure - README: Emphasize CPU and CUDA backend support upfront - README: Use descriptive function names (compute_naive, compute_pooled) - README: Consolidated redundant CPU/CUDA examples - docs/cuda.md: New dedicated CUDA backend documentation - docs/api.md: Minor consistency fix docs: emphasize automatic state management, move safety details to separate guide - README: Add "How It Works" section explaining automatic checkpoint/rewind - README: Simplify thread-safety to positive "safe by design" message - README: Remove API overview table (details in api.md) - README: One-line safety rule with link to full guide - docs/safety.md: New comprehensive safety guide with scope rules and examples docs(readme): add user responsibility note for scope management
… for native arrays
- Remove CUDA_CACHE_WAYS, use shared CACHE_WAYS from main module - Fix documentation typo: .complex64 → .complexf64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive CUDA backend support to AdaptiveArrayPools, enabling zero-allocation GPU memory pooling with the same API as the CPU implementation. The implementation follows a well-structured extension pattern and maintains backward compatibility while introducing GPU capabilities.
Key Changes
- Introduces abstract type hierarchy (
AbstractTypedPool,AbstractArrayPool) to enable backend extensibility - Adds backend dispatch via
Val{:backend}pattern for compile-time resolution with@with_pool :cudasyntax - Implements CUDA extension with unified N-way cache for all dimensions (returns
CuArrayinstead of view types) - Updates terminology from "arrays/vectors" to "slots" throughout for consistency
Reviewed changes
Copilot reviewed 33 out of 34 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/types.jl |
Adds abstract type hierarchy for extensibility |
src/acquire.jl |
Refactors allocation to use dispatch points (allocate_vector, wrap_array) |
src/macros.jl |
Implements _get_pool_for_backend dispatch and backend-specific code generation |
src/task_local_pool.jl |
Adds CUDA pool stub functions |
src/utils.jl |
Updates display logic to support both CPU and CUDA pools, changes "arrays" to "slots" |
ext/AdaptiveArrayPoolsCUDAExt/ |
Complete CUDA extension with types, dispatch, acquire logic, state management, and utilities |
test/cuda/ |
Comprehensive test suite for CUDA functionality (extension, allocation, N-way cache, display) |
test/test_backend_macro_expansion.jl |
Tests for backend-specific macro expansion |
docs/cuda.md, docs/safety.md |
New documentation for CUDA backend and safety guide |
README.md |
Updated to highlight CUDA support and simplified introduction |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #9 +/- ##
==========================================
- Coverage 96.68% 94.48% -2.20%
==========================================
Files 7 7
Lines 603 671 +68
==========================================
+ Hits 583 634 +51
- Misses 20 37 +17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
GPU memory pooling with zero-allocation design, matching the existing CPU pool semantics.
Features
Zero-Allocation GPU Pooling
@with_pool :cudamacro with automatic checkpoint/rewind lifecycleUnified API
acquire!/unsafe_acquire!interface as CPUCuArray{T,N}(not SubArray/ReshapedArray like CPU)Val{:backend}pattern for extensibilityUsage
Implementation Details
CuTypedPool{T}CuAdaptiveArrayPoolget_task_local_cuda_pool()CACHE_WAYS)Allocation Behavior
CuVectorresized and reused)