Add CUDA Backend Support #9

mgyoo86 · 2025-12-16T06:49:40Z

Summary

GPU memory pooling with zero-allocation design, matching the existing CPU pool semantics.

Features

Zero-Allocation GPU Pooling

@with_pool :cuda macro with automatic checkpoint/rewind lifecycle
Task-local pools with multi-device awareness (one pool per GPU per Task)
N-way view cache eliminates both GPU and CPU allocation after warmup

Unified API

Same acquire! / unsafe_acquire! interface as CPU
Returns CuArray{T,N} (not SubArray/ReshapedArray like CPU)
Backend dispatch via Val{:backend} pattern for extensibility

Usage

using AdaptiveArrayPools, CUDA

@with_pool :cuda pool begin
    A = acquire!(pool, Float64, 100, 100)  # CuArray{Float64,2}
    B = acquire!(pool, Float64, 100, 100)
    fill!(A, 1.0); fill!(B, 2.0)
    sum(A .+ B)
end  # automatic rewind - arrays returned to pool

# Zero GPU allocation in hot loops
for i in 1:1000
    @with_pool :cuda p begin
        tmp = acquire!(p, Float32, 1000)
        # ... compute ...
    end
end

Implementation Details

Component	Description
`CuTypedPool{T}`	Type-specific pool with N-way view cache
`CuAdaptiveArrayPool`	Multi-type pool with 8 fixed slots + fallback
`get_task_local_cuda_pool()`	Per-Task, per-device pool retrieval
N-way cache	4-way set-associative cache per slot (configurable via `CACHE_WAYS`)

Allocation Behavior

GPU: Always 0 bytes after warmup (backing CuVector resized and reused)
CPU: 0 bytes on cache hit (≤4 dimension patterns), ~80 bytes on miss

- Add AbstractTypedPool{T,V} and AbstractArrayPool abstract types - Make TypedPool and AdaptiveArrayPool inherit from abstract types - Add dispatch points: allocate_vector(), wrap_array() for GPU backends - Add Val-based backend dispatch: _get_pool_for_backend(::Val{:backend}) - Generalize get_view!, get_nd_array!, get_nd_view! to AbstractTypedPool - Generalize state functions to work with any AbstractTypedPool - Export abstract types for extension subtyping This enables GPU extensions (CUDA, Metal) to reuse 95%+ of the pool logic by only implementing allocation/wrapping dispatch methods.

Phase 2a+2b implementation: - Add CuTypedPool{T} (no view caching - GPU views return CuArray) - Add CuAdaptiveArrayPool with Float16 slot and device_id tracking - Implement allocate_vector, wrap_array, get_typed_pool! dispatches - Implement GPU-specific get_view! (fresh views each call, O(1) metadata) - Add checkpoint auto-init for dynamic types in others fallback - Configure package extension via weakdeps/extensions in Project.toml - Add verification scripts for CUDA behavior and extension tests

- Add get_task_local_cuda_pool() with multi-device Dict{Int, Pool} storage - Add get_task_local_cuda_pools() for diagnostic access - Implement checkpoint!/rewind!/reset!/empty! for CuAdaptiveArrayPool - Add foreach_fixed_slot for GPU pool iteration - Add empty! for CuTypedPool (no views field unlike CPU) - Support type-specific checkpoint/rewind variants

- Add backend-specific @with_pool macro variants using Val{:backend} dispatch - Register :cuda backend via _get_pool_for_backend(::Val{:cuda}) - Add explicit @with_cuda_pool macro as alias - Change all acquire functions to use AbstractArrayPool for extensibility - _mark_untracked!, _acquire_impl!, _unsafe_acquire_impl! - acquire!, unsafe_acquire! and all variants - Add test script for Phase 2d verification Enables: @with_pool :cuda pool begin ... end @with_cuda_pool pool begin ... end Nested CPU/GPU pools

- Add CUDA dependency to test/Project.toml for extension loading - Change CUDA test logic from opt-in (TEST_CUDA=true) to auto-detect - Use TEST_CUDA=false to explicitly skip CUDA tests when needed - Downgrade warnings to info messages for non-error skip conditions

Add typed checkpoint/rewind optimization to _generate_pool_code_with_backend, matching the optimization already present in regular @with_pool. This enables @with_pool :cuda to use fast typed operations when all acquire! types are statically known.

Add _generate_function_pool_code_with_backend to properly handle function definition syntax for backend-specific pool macros: @with_pool :cuda pool function f(x) ... end Previously, the macro only worked with block form. Now both forms correctly wrap the function body (not the definition) with pool operations (checkpoint/rewind). Also adds comprehensive test suite (94 tests) for backend macro expansion that verifies correct code generation without requiring actual CUDA installation.

…h_pool :cuda - Remove redundant @with_cuda_pool macro alias (users should use @with_pool :cuda) - Improve backend error message for unavailable backends - Add coverage tests for CUDA extension state management: - Multi-type checkpoint/rewind - Type-specific reset - Rewind at depth=1 edge cases - State operations with rare types (pool.others) - get_task_local_cuda_pools before pool creation

…y retrieval

- Unify get_view! to handle all dimensions (1D, 2D, 3D, etc.) with single cache - Achieve 0 bytes CPU allocation on cache hit for acquire! - get_view!(n::Int) delegates to get_view!((n,)) for API consistency - Add get_nd_view! override that delegates to unified get_view! - Cache stores CuArray{T,N} for any N using Vector{Any} with type assertions - GPU view()/reshape() return CuArray (not SubArray/ReshapedArray like CPU)

- Remove get_nd_array! implementation (80 bytes overhead) - Remove nd_arrays, nd_dims, nd_ptrs, nd_next_way fields from CuTypedPool - get_view! handles all dimensions with 0 bytes CPU alloc on cache hit - Simplify CuTypedPool struct: only vectors, views, view_dims needed - Update empty!() to match simplified struct

…ire! compatibility

- Add 4-way cache per slot (CUDA_CACHE_WAYS=4) for multiple dimension patterns - Implement round-robin cache replacement with next_way counter - Add resize-to-fit: backing vectors grow or shrink to match requested size - Add cache invalidation on resize (all ways) to prevent stale view references - Document CUDA.jl's internal 25% shrink threshold behavior - Update types.jl with next_way field and N-way cache layout docs

- Create test/cuda/runtests.jl as entry point with separated availability check - Move test_cuda_extension.jl to test/cuda/test_extension.jl - Update test/runtests.jl to include cuda/runtests.jl - Fix P1: CUDA test failures no longer swallowed by try/catch The availability check is now in try/catch, but test execution is outside, ensuring failures properly propagate.

- Add get_task_local_cuda_pool/get_task_local_cuda_pools stubs to main module - Extension now overrides stubs instead of defining new functions - Update docstrings for acquire!/unsafe_acquire! to be backend-agnostic - Simplify test/cuda/runtests.jl (functions now via dispatch, not extension) Users can now `using AdaptiveArrayPools` and call CUDA functions directly when CUDA.jl is loaded, without accessing extension module.

- Add pool_stats and Base.show methods for CuTypedPool, CuAdaptiveArrayPool - Add symbol dispatch: pool_stats(:cpu), pool_stats(:cuda) - pool_stats() now shows all pools (CPU + CUDA if loaded) - Rename terminology: arrays/vectors → slots for clarity - Simplify output format (remove unicode box drawing) - Use Base.format_bytes instead of custom _format_bytes - Add return nothing to all pool_stats functions

- Add test_allocation.jl: GPU memory reuse, pointer verification, resize behavior - Add test_nway_cache.jl: N-way cache verification (4-way hit=0, 5-way miss>0) - Add test_display.jl: pool_stats and Base.show for CuTypedPool/CuAdaptiveArrayPool - Update runtests.jl to include new test modules Key test principles: - GPU allocation should ALWAYS be 0 (memory reused from pool) - CPU allocation: cache hit (4-way) = 0, cache miss (5-way) = >0 - Separate GPU tests (with fill!) from CPU tests (without fill! to avoid kernel overhead)

- README: Rewritten with clear "The Problem" → "The Solution" structure - README: Emphasize CPU and CUDA backend support upfront - README: Use descriptive function names (compute_naive, compute_pooled) - README: Consolidated redundant CPU/CUDA examples - docs/cuda.md: New dedicated CUDA backend documentation - docs/api.md: Minor consistency fix docs: emphasize automatic state management, move safety details to separate guide - README: Add "How It Works" section explaining automatic checkpoint/rewind - README: Simplify thread-safety to positive "safe by design" message - README: Remove API overview table (details in api.md) - README: One-line safety rule with link to full guide - docs/safety.md: New comprehensive safety guide with scope rules and examples docs(readme): add user responsibility note for scope management

… for native arrays

- Remove CUDA_CACHE_WAYS, use shared CACHE_WAYS from main module - Fix documentation typo: .complex64 → .complexf64

Copilot

Pull request overview

This PR adds comprehensive CUDA backend support to AdaptiveArrayPools, enabling zero-allocation GPU memory pooling with the same API as the CPU implementation. The implementation follows a well-structured extension pattern and maintains backward compatibility while introducing GPU capabilities.

Key Changes

Introduces abstract type hierarchy (AbstractTypedPool, AbstractArrayPool) to enable backend extensibility
Adds backend dispatch via Val{:backend} pattern for compile-time resolution with @with_pool :cuda syntax
Implements CUDA extension with unified N-way cache for all dimensions (returns CuArray instead of view types)
Updates terminology from "arrays/vectors" to "slots" throughout for consistency

Reviewed changes

Copilot reviewed 33 out of 34 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/types.jl`	Adds abstract type hierarchy for extensibility
`src/acquire.jl`	Refactors allocation to use dispatch points (`allocate_vector`, `wrap_array`)
`src/macros.jl`	Implements `_get_pool_for_backend` dispatch and backend-specific code generation
`src/task_local_pool.jl`	Adds CUDA pool stub functions
`src/utils.jl`	Updates display logic to support both CPU and CUDA pools, changes "arrays" to "slots"
`ext/AdaptiveArrayPoolsCUDAExt/`	Complete CUDA extension with types, dispatch, acquire logic, state management, and utilities
`test/cuda/`	Comprehensive test suite for CUDA functionality (extension, allocation, N-way cache, display)
`test/test_backend_macro_expansion.jl`	Tests for backend-specific macro expansion
`docs/cuda.md`, `docs/safety.md`	New documentation for CUDA backend and safety guide
`README.md`	Updated to highlight CUDA support and simplified introduction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/test_backend_macro_expansion.jl

src/utils.jl

src/AdaptiveArrayPools.jl

codecov · 2025-12-16T06:53:34Z

Codecov Report

❌ Patch coverage is 84.74576% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.48%. Comparing base (a1bc284) to head (79299ac).
⚠️ Report is 23 commits behind head on master.

Files with missing lines	Patch %	Lines
src/macros.jl	81.13%	10 Missing ⚠️
src/utils.jl	81.81%	6 Missing ⚠️
src/types.jl	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
- Coverage   96.68%   94.48%   -2.20%     
==========================================
  Files           7        7              
  Lines         603      671      +68     
==========================================
+ Hits          583      634      +51     
- Misses         20       37      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mgyoo86 added 21 commits December 14, 2025 21:06

feat(tests): add conditional CUDA extension tests in runtests.jl

3d8415c

feat(cuda): add get_nd_array! implementation for N-dimensional CuArra…

4fc9998

…y retrieval

feat(cuda): add get_nd_array! delegation to get_view! for unsafe_acqu…

d32fda9

…ire! compatibility

docs(readme): clarify acquire! returns views, mention unsafe_acquire!…

e181660

… for native arrays

refactor(cuda): unify CACHE_WAYS constant and fix documentation typo

ccdaf75

- Remove CUDA_CACHE_WAYS, use shared CACHE_WAYS from main module - Fix documentation typo: .complex64 → .complexf64

mgyoo86 requested a review from Copilot December 16, 2025 06:49

Copilot started reviewing on behalf of mgyoo86 December 16, 2025 06:50 View session

Copilot AI reviewed Dec 16, 2025

View reviewed changes

test/test_backend_macro_expansion.jl Show resolved Hide resolved

src/utils.jl Show resolved Hide resolved

src/AdaptiveArrayPools.jl Show resolved Hide resolved

ci: add src directory to coverage processing step

79299ac

mgyoo86 merged commit 65e66b7 into master Dec 16, 2025
6 of 8 checks passed

mgyoo86 deleted the feat/cuda_backend branch December 16, 2025 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CUDA Backend Support #9

Add CUDA Backend Support #9

Uh oh!

mgyoo86 commented Dec 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Dec 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add CUDA Backend Support #9

Add CUDA Backend Support #9

Uh oh!

Conversation

mgyoo86 commented Dec 16, 2025

Summary

Features

Usage

Implementation Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 16, 2025 •

edited

Loading