Skip to content

Conversation

@mgyoo86
Copy link
Collaborator

@mgyoo86 mgyoo86 commented Dec 16, 2025

Summary

GPU memory pooling with zero-allocation design, matching the existing CPU pool semantics.

Features

Zero-Allocation GPU Pooling

  • @with_pool :cuda macro with automatic checkpoint/rewind lifecycle
  • Task-local pools with multi-device awareness (one pool per GPU per Task)
  • N-way view cache eliminates both GPU and CPU allocation after warmup

Unified API

  • Same acquire! / unsafe_acquire! interface as CPU
  • Returns CuArray{T,N} (not SubArray/ReshapedArray like CPU)
  • Backend dispatch via Val{:backend} pattern for extensibility

Usage

using AdaptiveArrayPools, CUDA

@with_pool :cuda pool begin
    A = acquire!(pool, Float64, 100, 100)  # CuArray{Float64,2}
    B = acquire!(pool, Float64, 100, 100)
    fill!(A, 1.0); fill!(B, 2.0)
    sum(A .+ B)
end  # automatic rewind - arrays returned to pool

# Zero GPU allocation in hot loops
for i in 1:1000
    @with_pool :cuda p begin
        tmp = acquire!(p, Float32, 1000)
        # ... compute ...
    end
end

Implementation Details

Component Description
CuTypedPool{T} Type-specific pool with N-way view cache
CuAdaptiveArrayPool Multi-type pool with 8 fixed slots + fallback
get_task_local_cuda_pool() Per-Task, per-device pool retrieval
N-way cache 4-way set-associative cache per slot (configurable via CACHE_WAYS)

Allocation Behavior

  • GPU: Always 0 bytes after warmup (backing CuVector resized and reused)
  • CPU: 0 bytes on cache hit (≤4 dimension patterns), ~80 bytes on miss

- Add AbstractTypedPool{T,V} and AbstractArrayPool abstract types
- Make TypedPool and AdaptiveArrayPool inherit from abstract types
- Add dispatch points: allocate_vector(), wrap_array() for GPU backends
- Add Val-based backend dispatch: _get_pool_for_backend(::Val{:backend})
- Generalize get_view!, get_nd_array!, get_nd_view! to AbstractTypedPool
- Generalize state functions to work with any AbstractTypedPool
- Export abstract types for extension subtyping

This enables GPU extensions (CUDA, Metal) to reuse 95%+ of the pool
logic by only implementing allocation/wrapping dispatch methods.
Phase 2a+2b implementation:
- Add CuTypedPool{T} (no view caching - GPU views return CuArray)
- Add CuAdaptiveArrayPool with Float16 slot and device_id tracking
- Implement allocate_vector, wrap_array, get_typed_pool! dispatches
- Implement GPU-specific get_view! (fresh views each call, O(1) metadata)
- Add checkpoint auto-init for dynamic types in others fallback
- Configure package extension via weakdeps/extensions in Project.toml
- Add verification scripts for CUDA behavior and extension tests
- Add get_task_local_cuda_pool() with multi-device Dict{Int, Pool} storage
- Add get_task_local_cuda_pools() for diagnostic access
- Implement checkpoint!/rewind!/reset!/empty! for CuAdaptiveArrayPool
- Add foreach_fixed_slot for GPU pool iteration
- Add empty! for CuTypedPool (no views field unlike CPU)
- Support type-specific checkpoint/rewind variants
- Add backend-specific @with_pool macro variants using Val{:backend} dispatch
- Register :cuda backend via _get_pool_for_backend(::Val{:cuda})
- Add explicit @with_cuda_pool macro as alias
- Change all acquire functions to use AbstractArrayPool for extensibility
  - _mark_untracked!, _acquire_impl!, _unsafe_acquire_impl!
  - acquire!, unsafe_acquire! and all variants
- Add test script for Phase 2d verification

Enables:
  @with_pool :cuda pool begin ... end
  @with_cuda_pool pool begin ... end
  Nested CPU/GPU pools
- Add CUDA dependency to test/Project.toml for extension loading
- Change CUDA test logic from opt-in (TEST_CUDA=true) to auto-detect
- Use TEST_CUDA=false to explicitly skip CUDA tests when needed
- Downgrade warnings to info messages for non-error skip conditions
Add typed checkpoint/rewind optimization to _generate_pool_code_with_backend,
matching the optimization already present in regular @with_pool. This enables
@with_pool :cuda to use fast typed operations when all acquire! types are
statically known.
Add _generate_function_pool_code_with_backend to properly handle
function definition syntax for backend-specific pool macros:
  @with_pool :cuda pool function f(x) ... end

Previously, the macro only worked with block form. Now both forms
correctly wrap the function body (not the definition) with pool
operations (checkpoint/rewind).

Also adds comprehensive test suite (94 tests) for backend macro
expansion that verifies correct code generation without requiring
actual CUDA installation.
…h_pool :cuda

- Remove redundant @with_cuda_pool macro alias (users should use @with_pool :cuda)
- Improve backend error message for unavailable backends
- Add coverage tests for CUDA extension state management:
  - Multi-type checkpoint/rewind
  - Type-specific reset
  - Rewind at depth=1 edge cases
  - State operations with rare types (pool.others)
  - get_task_local_cuda_pools before pool creation
- Unify get_view! to handle all dimensions (1D, 2D, 3D, etc.) with single cache
- Achieve 0 bytes CPU allocation on cache hit for acquire!
- get_view!(n::Int) delegates to get_view!((n,)) for API consistency
- Add get_nd_view! override that delegates to unified get_view!
- Cache stores CuArray{T,N} for any N using Vector{Any} with type assertions
- GPU view()/reshape() return CuArray (not SubArray/ReshapedArray like CPU)
- Remove get_nd_array! implementation (80 bytes overhead)
- Remove nd_arrays, nd_dims, nd_ptrs, nd_next_way fields from CuTypedPool
- get_view! handles all dimensions with 0 bytes CPU alloc on cache hit
- Simplify CuTypedPool struct: only vectors, views, view_dims needed
- Update empty!() to match simplified struct
- Add 4-way cache per slot (CUDA_CACHE_WAYS=4) for multiple dimension patterns
- Implement round-robin cache replacement with next_way counter
- Add resize-to-fit: backing vectors grow or shrink to match requested size
- Add cache invalidation on resize (all ways) to prevent stale view references
- Document CUDA.jl's internal 25% shrink threshold behavior
- Update types.jl with next_way field and N-way cache layout docs
- Create test/cuda/runtests.jl as entry point with separated availability check
- Move test_cuda_extension.jl to test/cuda/test_extension.jl
- Update test/runtests.jl to include cuda/runtests.jl
- Fix P1: CUDA test failures no longer swallowed by try/catch

The availability check is now in try/catch, but test execution is outside,
ensuring failures properly propagate.
- Add get_task_local_cuda_pool/get_task_local_cuda_pools stubs to main module
- Extension now overrides stubs instead of defining new functions
- Update docstrings for acquire!/unsafe_acquire! to be backend-agnostic
- Simplify test/cuda/runtests.jl (functions now via dispatch, not extension)

Users can now `using AdaptiveArrayPools` and call CUDA functions directly
when CUDA.jl is loaded, without accessing extension module.
- Add pool_stats and Base.show methods for CuTypedPool, CuAdaptiveArrayPool
- Add symbol dispatch: pool_stats(:cpu), pool_stats(:cuda)
- pool_stats() now shows all pools (CPU + CUDA if loaded)
- Rename terminology: arrays/vectors → slots for clarity
- Simplify output format (remove unicode box drawing)
- Use Base.format_bytes instead of custom _format_bytes
- Add return nothing to all pool_stats functions
- Add test_allocation.jl: GPU memory reuse, pointer verification, resize behavior
- Add test_nway_cache.jl: N-way cache verification (4-way hit=0, 5-way miss>0)
- Add test_display.jl: pool_stats and Base.show for CuTypedPool/CuAdaptiveArrayPool
- Update runtests.jl to include new test modules

Key test principles:
- GPU allocation should ALWAYS be 0 (memory reused from pool)
- CPU allocation: cache hit (4-way) = 0, cache miss (5-way) = >0
- Separate GPU tests (with fill!) from CPU tests (without fill! to avoid kernel overhead)
- README: Rewritten with clear "The Problem" → "The Solution" structure
- README: Emphasize CPU and CUDA backend support upfront
- README: Use descriptive function names (compute_naive, compute_pooled)
- README: Consolidated redundant CPU/CUDA examples
- docs/cuda.md: New dedicated CUDA backend documentation
- docs/api.md: Minor consistency fix

docs: emphasize automatic state management, move safety details to separate guide

- README: Add "How It Works" section explaining automatic checkpoint/rewind
- README: Simplify thread-safety to positive "safe by design" message
- README: Remove API overview table (details in api.md)
- README: One-line safety rule with link to full guide
- docs/safety.md: New comprehensive safety guide with scope rules and examples

docs(readme): add user responsibility note for scope management
- Remove CUDA_CACHE_WAYS, use shared CACHE_WAYS from main module
- Fix documentation typo: .complex64 → .complexf64
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive CUDA backend support to AdaptiveArrayPools, enabling zero-allocation GPU memory pooling with the same API as the CPU implementation. The implementation follows a well-structured extension pattern and maintains backward compatibility while introducing GPU capabilities.

Key Changes

  • Introduces abstract type hierarchy (AbstractTypedPool, AbstractArrayPool) to enable backend extensibility
  • Adds backend dispatch via Val{:backend} pattern for compile-time resolution with @with_pool :cuda syntax
  • Implements CUDA extension with unified N-way cache for all dimensions (returns CuArray instead of view types)
  • Updates terminology from "arrays/vectors" to "slots" throughout for consistency

Reviewed changes

Copilot reviewed 33 out of 34 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/types.jl Adds abstract type hierarchy for extensibility
src/acquire.jl Refactors allocation to use dispatch points (allocate_vector, wrap_array)
src/macros.jl Implements _get_pool_for_backend dispatch and backend-specific code generation
src/task_local_pool.jl Adds CUDA pool stub functions
src/utils.jl Updates display logic to support both CPU and CUDA pools, changes "arrays" to "slots"
ext/AdaptiveArrayPoolsCUDAExt/ Complete CUDA extension with types, dispatch, acquire logic, state management, and utilities
test/cuda/ Comprehensive test suite for CUDA functionality (extension, allocation, N-way cache, display)
test/test_backend_macro_expansion.jl Tests for backend-specific macro expansion
docs/cuda.md, docs/safety.md New documentation for CUDA backend and safety guide
README.md Updated to highlight CUDA support and simplified introduction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Dec 16, 2025

Codecov Report

❌ Patch coverage is 84.74576% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.48%. Comparing base (a1bc284) to head (79299ac).
⚠️ Report is 23 commits behind head on master.

Files with missing lines Patch % Lines
src/macros.jl 81.13% 10 Missing ⚠️
src/utils.jl 81.81% 6 Missing ⚠️
src/types.jl 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
- Coverage   96.68%   94.48%   -2.20%     
==========================================
  Files           7        7              
  Lines         603      671      +68     
==========================================
+ Hits          583      634      +51     
- Misses         20       37      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mgyoo86 mgyoo86 merged commit 65e66b7 into master Dec 16, 2025
6 of 8 checks passed
@mgyoo86 mgyoo86 deleted the feat/cuda_backend branch December 16, 2025 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants