ProjectTorreyPines · mgyoo86 · Dec 16, 2025 · Dec 15, 2025 · Dec 15, 2025 · Dec 15, 2025
diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml
@@ -41,6 +41,8 @@ jobs:
       - uses: julia-actions/julia-runtest@v1
 
       - uses: julia-actions/julia-processcoverage@v1
+        with:
+          directories: src
 
       - uses: codecov/codecov-action@v4
         with:

diff --git a/Project.toml b/Project.toml
@@ -6,3 +6,9 @@ authors = ["Min-Gu Yoo <mgyoo86@gmail.com>"]
 [deps]
 Preferences = "21216c6a-2e73-6563-6e65-726566657250"
 Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
+
+[weakdeps]
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
+
+[extensions]
+AdaptiveArrayPoolsCUDAExt = "CUDA"
diff --git a/README.md b/README.md
@@ -3,311 +3,106 @@
 
 # AdaptiveArrayPools.jl
 
-**Zero-allocation array pooling for Julia.**
-Reuse temporary arrays to eliminate Garbage Collection (GC) pressure in high-performance hot loops.
+**Zero-allocation temporary arrays for Julia.**
 
-## Installation
+A lightweight library that lets you write natural, allocation-style code while automatically reusing memory behind the scenes. Eliminates GC pressure in hot loops without the complexity of manual buffer management.
 
-`AdaptiveArrayPools` is registered with [FuseRegistry](https://github.com/ProjectTorreyPines/FuseRegistry.jl/):
+**Supported backends:**
+- **CPU** — `Array`, works out of the box
+- **CUDA** — `CuArray`, loads automatically when [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) is available
 
-```julia
-using Pkg
-Pkg.Registry.add(RegistrySpec(url="https://github.com/ProjectTorreyPines/FuseRegistry.jl.git"))
-Pkg.Registry.add("General")
-Pkg.add("AdaptiveArrayPools")
-```
+## The Problem
 
-## Quick Start
+In performance-critical code, temporary array allocations inside loops create massive GC pressure:
 
 ```julia
-using AdaptiveArrayPools, LinearAlgebra
-
-# 1. Define the hot-loop function with automatic pooling for ZERO-ALLOCATION
-@with_pool pool function heavy_computation_step(n)
-    # Safe Default: Returns ReshapedArray for N-D (always 0 bytes, prevents resize!)
-    A = acquire!(pool, Float64, n, n)
-    B = acquire!(pool, Float64, n, n)
-
-    # Power User: Returns raw Matrix{Float64} (only for FFI/type constraints)
-    # ⚠️ Must NOT resize! or escape scope
-    C = unsafe_acquire!(pool, Float64, n, n)
-
-    # Use them like normal arrays
-    fill!(A, 1.0); fill!(B, 2.0)
-
-    # Pass to inner functions as needed
-    complex_inner_logic!(C, A, B)
-
-    return sum(C) 
-    # ⚠️ Arrays A, B, C must not escape this scope; they become invalid after this function returns!
+function compute_naive(n)
+    A = rand(n, n)      # allocates
+    B = rand(n, n)      # allocates
+    C = A * B           # allocates
+    return sum(C)
 end
 
-# Standard Julia function (unaware of pooling)
-function complex_inner_logic!(C, A, B)
-    mul!(C, A, B)
-end
-
-# 2. Main application entry point
-function main_simulation_loop()
-    # ... complex setup logic ...
-
-    total = 0.0
-    # This loop would normally generate massive GC pressure
-    for i in 1:1000
-        # ✅ Zero allocation here after the first iteration!
-        total += heavy_computation_step(100)
-    end
-
-    return total
+for i in 1:10_000
+    compute_naive(100)  # 91 MiB total, 17% GC time
 end
-
-# Run simulation
-main_simulation_loop()
 ```
 
-## Why Use This?
+The traditional fix—passing pre-allocated buffers through your call stack—works but requires invasive refactoring and clutters your APIs.
 
-In high-performance computing, allocating temporary arrays inside a loop creates significant GC pressure, causing stuttering and performance degradation. Manual in-place operations (passing pre-allocated buffers) avoid this but require tedious buffer management and argument passing, making code complex and error-prone.
+## The Solution
 
-```julia
-using LinearAlgebra, Random
-using BenchmarkTools
-
-# ❌ Naive Approach: Allocates new arrays every single call
-function compute_naive(n::Int)
-    mat1 = rand(n, n) # Allocation!
-    mat2 = rand(n, n) # Allocation!
-
-    mat3 = mat1 * mat2 # Allocation!
-    return sum(mat3)
-end
-
-# ✅ Pooled Approach: Zero allocations in steady state, clean syntax (no manual buffer passing)
-@with_pool pool function compute_pooled(n::Int)
-    # Get ReshapedArray views from auto-managed pool (0 bytes allocation)
-    mat1 = acquire!(pool, Float64, n, n)
-    mat2 = acquire!(pool, Float64, n, n)
-    mat3 = acquire!(pool, Float64, n, n)
-
-    # Use In-place functions without allocations
-    Random.rand!(mat1)
-    Random.rand!(mat2)
-    mul!(mat3, mat1, mat2)
-    return sum(mat3)
-end
-
-# Naive: Large temporary allocations cause GC pressure
-@benchmark compute_naive(2000)
-# Time  (mean ± σ):   67.771 ms ±  31.818 ms ⚠️ ┊ GC (mean ± σ):  17.02% ± 18.69%  ⚠️
-# Memory estimate: 91.59 MiB ⚠️, allocs estimate: 9.
-
-# Pooled: Zero allocations, no GC pressure
-@benchmark compute_pooled(2000)
-# Time  (mean ± σ):   57.647 ms ±  3.960 ms ✅ ┊ GC (mean ± σ):  0.00% ± 0.00% ✅
-# Memory estimate: 0 bytes ✅, allocs estimate: 0.
-```
-
-> **Performance Note:**
-> - **vs Manual Pre-allocation**: This library achieves performance comparable to manually passing pre-allocated buffers (in-place operations), but without the boilerplate of managing buffer lifecycles.
-> - **Low Overhead**: The overhead of `@with_pool` (including checkpoint/rewind) is typically **tens of nanoseconds** (< 100 ns), making it negligible for most workloads compared to the cost of memory allocation.
-
-## Important: User Responsibility
-
-This library prioritizes **zero-overhead performance** over runtime safety checks. Two fundamental rules must be followed:
-
-1. **Scope Rule**: Arrays acquired from a pool are only valid within the `@with_pool` scope.
-2. **Task Rule**: Pool objects must not be shared across Tasks (see [Multi-Threading Usage](#multi-threading-usage)).
-
-When `@with_pool` ends, all acquired arrays are "rewound" and their memory becomes available for reuse. Using them after the scope ends leads to **undefined behavior** (data corruption, crashes).
-
-<details>
-<summary><b>Safe Patterns</b> (click to expand)</summary>
+Wrap your function with `@with_pool` and use `acquire!` instead of allocation:
 
 ```julia
-@with_pool pool function safe_example(n)
-    v = acquire!(pool, Float64, n)
-    v .= 1.0
+using AdaptiveArrayPools, LinearAlgebra, Random
 
-    # ✅ Return computed values (scalars, tuples, etc.)
-    return sum(v), length(v)
-end
-
-@with_pool pool function safe_copy(n)
-    v = acquire!(pool, Float64, n)
-    v .= rand(n)
-
-    # ✅ Return a copy if you need the data outside
-    return copy(v)
-end
-```
-
-</details>
-
-<details>
-<summary><b>Unsafe Patterns (DO NOT DO THIS)</b> (click to expand)</summary>
+@with_pool pool function compute_pooled(n)
+    A = acquire!(pool, Float64, n, n)  # reuses memory from pool
+    B = acquire!(pool, Float64, n, n)
+    C = acquire!(pool, Float64, n, n)
 
-```julia
-@with_pool pool function unsafe_return(n)
-    v = acquire!(pool, Float64, n)
-    v .= 1.0
-    return v  # ❌ UNSAFE: Returning pool-backed array!
+    rand!(A); rand!(B)
+    mul!(C, A, B)
+    return sum(C)
 end
 
-result = unsafe_return(100)
-# result now points to memory that may be overwritten!
-
-# ❌ Also unsafe: storing in global variables, closures, etc.
-global_storage = nothing
-@with_pool pool begin
-    v = acquire!(pool, Float64, 100)
-    global_storage = v  # ❌ UNSAFE: escaping via global
+compute_pooled(100)  # warmup
+for i in 1:10_000
+    compute_pooled(100)  # 0 bytes, 0% GC
 end
 ```
 
-</details>
+| Approach | Memory | GC Time | Code Complexity |
+|----------|--------|---------|-----------------|
+| Naive allocation | 91 MiB | 17% | Simple |
+| Manual buffer passing | 0 | 0% | Complex, invasive refactor |
+| **AdaptiveArrayPools** | **0** | **0%** | **Minimal change** |
 
-<details>
-<summary><b>Debugging with POOL_DEBUG</b> (click to expand)</summary>
+> **CUDA support**: Same API—just use `@with_pool :cuda pool`. See [CUDA Backend](docs/cuda.md).
 
-Enable `POOL_DEBUG` to catch direct returns of pool-backed arrays:
+## How It Works
 
-```julia
-POOL_DEBUG[] = true  # Enable safety checks
-
-@with_pool pool begin
-    v = acquire!(pool, Float64, 10)
-    v  # Throws ErrorException: "Returning pool-backed array..."
-end
-```
-
-> **Note:** `POOL_DEBUG` only catches direct returns, not indirect escapes (globals, closures). It's a development aid, not a guarantee.
+`@with_pool` automatically manages memory lifecycle for you:
 
-</details>
+1. **Checkpoint** — Saves current pool state when entering the block
+2. **Acquire** — `acquire!` returns arrays backed by pooled memory
+3. **Rewind** — When the block ends, all acquired arrays are recycled for reuse
 
-## Key Features
+This automatic checkpoint/rewind cycle is what enables zero allocation on repeated calls. You just write normal-looking code with `acquire!` instead of constructors.
 
-- **`acquire!` — True Zero Allocation**: Returns lightweight views (`SubArray` for 1D, `ReshapedArray` for N-D) that are created on the stack. **Always 0 bytes**, regardless of dimension patterns or cache state.
-- **`unsafe_acquire!` — Cached Allocation**: Returns concrete `Array` types (`Vector{T}` for 1D, `Array{T,N}` for N-D) for FFI/type constraints.
-  - All dimensions use N-way set-associative cache (default: 4-way) → **0 bytes on cache hit**, ~100 bytes on cache miss.
-  - Increase `CACHE_WAYS` if you alternate between >4 dimension patterns per slot.
-  - Even on cache miss, this is just the `Array` header (metadata)—**actual data memory is always reused from the pool**.
-- **Low Overhead**: Optimized to have < 100 ns overhead for pool management, suitable for tight inner loops.
-- **Task-Local Isolation**: Each Task gets its own pool via `task_local_storage()`. Thread-safe when `@with_pool` is called within each task's scope (see [Multi-Threading Usage](#multi-threading-usage) below).
-- **Type Stable**: Optimized for `Float64`, `Int`, and other common types using fixed-slot caching.
-- **Non-Intrusive**: If you disable pooling via preferences, `acquire!` compiles down to a standard `Array` allocation.
-- **Flexible API**: Use `acquire!` for safe views (recommended), or `unsafe_acquire!` when concrete `Array` type is required (FFI, type constraints).
+`acquire!` returns lightweight views (`SubArray`, `ReshapedArray`) that work seamlessly with BLAS/LAPACK. If you need native `Array` types (FFI, type constraints), use `unsafe_acquire!`—see [API Reference](docs/api.md).
 
-## Multi-Threading Usage
+> **Note**: Keeping acquired arrays inside the scope is your responsibility. Return computed values (scalars, copies), not the arrays themselves. See [Safety Guide](docs/safety.md).
 
-AdaptiveArrayPools uses `task_local_storage()` for **task-local isolation**: each Julia Task gets its own independent pool.
+**Thread-safe by design**: Each Julia Task gets its own independent pool, so `@with_pool` inside threaded code is automatically safe:
 
 ```julia
-# ✅ SAFE: @with_pool inside @threads
 Threads.@threads for i in 1:N
     @with_pool pool begin
         a = acquire!(pool, Float64, 100)
+        # each thread has its own pool — no race conditions
     end
 end
-
-# ❌ UNSAFE: @with_pool outside @threads (race condition!)
-@with_pool pool Threads.@threads for i in 1:N
-    a = acquire!(pool, Float64, 100)  # All threads share one pool!
-end
 ```
 
-| Pattern | Safety |
-|---------|--------|
-| `@with_pool` inside `@threads` | ✅ Safe |
-| `@with_pool` outside `@threads` | ❌ Unsafe |
-| Function with `@with_pool` called from `@threads` | ✅ Safe |
-
-> **Important**: Pool objects must not be shared across Tasks. This library does not add locks—correct usage is the user's responsibility.
-
-For detailed explanation including Julia's Task/Thread model and why thread-local pools don't work, see **[Multi-Threading Guide](docs/multi-threading.md)**.
-
-## `acquire!` vs `unsafe_acquire!`
-
-**In most cases, use `acquire!`**. It returns view types (`SubArray` for 1D, `ReshapedArray` for N-D) that are safe and always zero-allocation.
-
-> **Performance Note**: BLAS/LAPACK functions (`mul!`, `lu!`, etc.) are fully optimized for `StridedArray`—there is **no performance difference** between views and raw arrays. Benchmarks show identical throughput.
-
-Use `unsafe_acquire!` **only** when a concrete `Array{T,N}` type is required:
-- **FFI/C interop**: External libraries expecting `Ptr{T}` from `Array`
-- **Type constraints**: APIs that explicitly require `Matrix{T}` or `Vector{T}`, or type-unstable code where concrete types reduce dispatch overhead
-
-```julia
-@with_pool pool begin
-    # ✅ Recommended: acquire! for general use (always 0 bytes)
-    A = acquire!(pool, Float64, 100, 100)   # ReshapedArray
-    B = acquire!(pool, Float64, 100, 100)   # ReshapedArray
-    C = acquire!(pool, Float64, 100, 100)   # ReshapedArray
-    mul!(C, A, B)  # ✅ BLAS works perfectly with views!
-
-    # ⚠️ Only when concrete Array type is required:
-    M = unsafe_acquire!(pool, Float64, 100, 100)  # Matrix{Float64}
-    ccall(:some_c_function, Cvoid, (Ptr{Float64},), M)  # FFI needs Array
-end
-```
-
-| Function | 1D Return | N-D Return | Allocation |
-|----------|-----------|------------|------------|
-| `acquire!` | `SubArray{T,1}` | `ReshapedArray{T,N}` | Always 0 bytes |
-| `unsafe_acquire!` | `Vector{T}` | `Array{T,N}` | 0 bytes (hit) / ~100 bytes header (miss) |
-
-> **Note**: `unsafe_acquire!` always returns concrete `Array` types (including `Vector` for 1D). The N-way cache applies to all dimensions—up to `CACHE_WAYS` (default: 4) dimension patterns per slot; exceeding this causes header-only allocation per miss.
-
-> **Warning**: Both functions return memory only valid within the `@with_pool` scope. Do NOT call `resize!`, `push!`, or `append!` on acquired arrays.
-
-### API Aliases
-
-For explicit naming, you can use these aliases:
+## Installation
 
 ```julia
-acquire_view!(pool, T, dims...)   # Same as acquire! → returns view types
-acquire_array!(pool, T, dims...)  # Same as unsafe_acquire! → returns Array
+using Pkg
+Pkg.Registry.add(Pkg.RegistrySpec(url="https://github.com/ProjectTorreyPines/FuseRegistry.jl.git"))
+Pkg.add("AdaptiveArrayPools")
 ```
 
 ## Documentation
 
-- [API Reference](docs/api.md) - Macros, functions, and types
-- [Multi-Threading Guide](docs/multi-threading.md) - Task/Thread model, safe patterns, and design rationale
-- [Runtime Toggle: @maybe_with_pool](docs/maybe_with_pool.md) - Control pooling at runtime
-- [Configuration](docs/configuration.md) - Preferences.jl integration
-
-## Configuration
-
-Configure AdaptiveArrayPools via `LocalPreferences.toml`:
-
-```toml
-[AdaptiveArrayPools]
-use_pooling = false  # ⭐ Primary: Disable pooling entirely
-cache_ways = 8       # Secondary: N-way cache size (default: 4)
-```
-
-### Disabling Pooling (Primary Use Case)
-
-The most important configuration is **`use_pooling = false`**, which completely disables all pooling:
-
-```julia
-# With use_pooling = false, acquire! becomes equivalent to:
-acquire!(pool, Float64, n, n)  →  Matrix{Float64}(undef, n, n)
-```
-
-This is useful for:
-- **Debugging**: Isolate pooling-related issues by comparing behavior
-- **Benchmarking**: Measure pooling overhead vs direct allocation
-- **Gradual adoption**: Add `@with_pool` to code without changing behavior until ready
-
-When disabled, all macros generate `pool = nothing` and `acquire!` falls back to standard allocation with **zero overhead**.
-
-### N-way Cache Tuning (Advanced)
-
-```julia
-using AdaptiveArrayPools
-set_cache_ways!(8)  # Requires Julia restart
-```
-
-Increase `cache_ways` if alternating between >4 dimension patterns per slot.
+| Guide | Description |
+|-------|-------------|
+| [API Reference](docs/api.md) | Complete function and macro reference |
+| [CUDA Backend](docs/cuda.md) | GPU-specific usage and examples |
+| [Safety Guide](docs/safety.md) | Scope rules and best practices |
+| [Multi-Threading](docs/multi-threading.md) | Task/thread safety patterns |
+| [Configuration](docs/configuration.md) | Preferences and cache tuning |
 
 ## License
 

diff --git a/docs/api.md b/docs/api.md
@@ -14,7 +14,7 @@
 | `acquire!(pool, T, dims...)` | Returns a view: `SubArray{T,1}` for 1D, `ReshapedArray{T,N}` for N-D. Always 0 bytes. |
 | `acquire!(pool, T, dims::Tuple)` | Tuple overload for `acquire!` (e.g., `acquire!(pool, T, size(x))`). |
 | `acquire!(pool, x::AbstractArray)` | Similar-style: acquires array matching `eltype(x)` and `size(x)`. |
-| `unsafe_acquire!(pool, T, dims...)` | Returns `SubArray{T,1}` for 1D, raw `Array{T,N}` for N-D. Only for FFI/type constraints. |
+| `unsafe_acquire!(pool, T, dims...)` | Returns native `Array`/`CuArray` (CPU: `Vector{T}` for 1D, `Array{T,N}` for N-D). Only for FFI/type constraints. |
 | `unsafe_acquire!(pool, T, dims::Tuple)` | Tuple overload for `unsafe_acquire!`. |
 | `unsafe_acquire!(pool, x::AbstractArray)` | Similar-style: acquires raw array matching `eltype(x)` and `size(x)`. |
 | `acquire_view!(pool, T, dims...)` | Alias for `acquire!`. Returns view types. |