Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
67e0852
refactor: add abstract type hierarchy for GPU backend extensibility
mgyoo86 Dec 15, 2025
d5def82
feat(cuda): add CUDA extension with GPU memory pooling
mgyoo86 Dec 15, 2025
874b0be
feat(cuda): add task-local pool and state management (Phase 2c)
mgyoo86 Dec 15, 2025
8b5b17e
feat(cuda): add macro integration for @with_pool :cuda syntax
mgyoo86 Dec 15, 2025
3d8415c
feat(tests): add conditional CUDA extension tests in runtests.jl
mgyoo86 Dec 15, 2025
29bd414
test: auto-detect CUDA for extension tests
mgyoo86 Dec 15, 2025
a26443e
feat(macros): add type-specific optimization to backend pool macro
mgyoo86 Dec 15, 2025
24016a3
feat(macros): add function form support for backend macros
mgyoo86 Dec 15, 2025
b6f89a0
refactor(cuda): remove @with_cuda_pool macro in favor of unified @wit…
mgyoo86 Dec 15, 2025
4fc9998
feat(cuda): add get_nd_array! implementation for N-dimensional CuArra…
mgyoo86 Dec 15, 2025
950813d
feat(cuda): implement unified 1-way view cache for zero CPU allocation
mgyoo86 Dec 15, 2025
558d1cb
refactor(cuda): remove get_nd_array! and N-way cache, unify to get_view!
mgyoo86 Dec 15, 2025
d32fda9
feat(cuda): add get_nd_array! delegation to get_view! for unsafe_acqu…
mgyoo86 Dec 15, 2025
4038a46
feat(cuda): implement N-way view cache with resize-to-fit strategy
mgyoo86 Dec 15, 2025
7baccef
refactor(test): move CUDA tests to dedicated test/cuda/ directory
mgyoo86 Dec 16, 2025
f973246
refactor: export CUDA pool functions from main module with stub pattern
mgyoo86 Dec 16, 2025
a3a6e9d
feat(utils): add CUDA pool_stats and unified display API
mgyoo86 Dec 16, 2025
e34cab9
test(cuda): add comprehensive GPU allocation and cache tests
mgyoo86 Dec 16, 2025
074358b
docs: restructure README with problem/solution format, add CUDA docs
mgyoo86 Dec 16, 2025
e181660
docs(readme): clarify acquire! returns views, mention unsafe_acquire!…
mgyoo86 Dec 16, 2025
ccdaf75
refactor(cuda): unify CACHE_WAYS constant and fix documentation typo
mgyoo86 Dec 16, 2025
79299ac
ci: add src directory to coverage processing step
mgyoo86 Dec 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ jobs:
- uses: julia-actions/julia-runtest@v1

- uses: julia-actions/julia-processcoverage@v1
with:
directories: src

- uses: codecov/codecov-action@v4
with:
Expand Down
6 changes: 6 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,9 @@ authors = ["Min-Gu Yoo <mgyoo86@gmail.com>"]
[deps]
Preferences = "21216c6a-2e73-6563-6e65-726566657250"
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[weakdeps]
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"

[extensions]
AdaptiveArrayPoolsCUDAExt = "CUDA"
315 changes: 55 additions & 260 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,311 +3,106 @@

# AdaptiveArrayPools.jl

**Zero-allocation array pooling for Julia.**
Reuse temporary arrays to eliminate Garbage Collection (GC) pressure in high-performance hot loops.
**Zero-allocation temporary arrays for Julia.**

## Installation
A lightweight library that lets you write natural, allocation-style code while automatically reusing memory behind the scenes. Eliminates GC pressure in hot loops without the complexity of manual buffer management.

`AdaptiveArrayPools` is registered with [FuseRegistry](https://github.com/ProjectTorreyPines/FuseRegistry.jl/):
**Supported backends:**
- **CPU** — `Array`, works out of the box
- **CUDA** — `CuArray`, loads automatically when [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) is available

```julia
using Pkg
Pkg.Registry.add(RegistrySpec(url="https://github.com/ProjectTorreyPines/FuseRegistry.jl.git"))
Pkg.Registry.add("General")
Pkg.add("AdaptiveArrayPools")
```
## The Problem

## Quick Start
In performance-critical code, temporary array allocations inside loops create massive GC pressure:

```julia
using AdaptiveArrayPools, LinearAlgebra

# 1. Define the hot-loop function with automatic pooling for ZERO-ALLOCATION
@with_pool pool function heavy_computation_step(n)
# Safe Default: Returns ReshapedArray for N-D (always 0 bytes, prevents resize!)
A = acquire!(pool, Float64, n, n)
B = acquire!(pool, Float64, n, n)

# Power User: Returns raw Matrix{Float64} (only for FFI/type constraints)
# ⚠️ Must NOT resize! or escape scope
C = unsafe_acquire!(pool, Float64, n, n)

# Use them like normal arrays
fill!(A, 1.0); fill!(B, 2.0)

# Pass to inner functions as needed
complex_inner_logic!(C, A, B)

return sum(C)
# ⚠️ Arrays A, B, C must not escape this scope; they become invalid after this function returns!
function compute_naive(n)
A = rand(n, n) # allocates
B = rand(n, n) # allocates
C = A * B # allocates
return sum(C)
end

# Standard Julia function (unaware of pooling)
function complex_inner_logic!(C, A, B)
mul!(C, A, B)
end

# 2. Main application entry point
function main_simulation_loop()
# ... complex setup logic ...

total = 0.0
# This loop would normally generate massive GC pressure
for i in 1:1000
# ✅ Zero allocation here after the first iteration!
total += heavy_computation_step(100)
end

return total
for i in 1:10_000
compute_naive(100) # 91 MiB total, 17% GC time
end

# Run simulation
main_simulation_loop()
```

## Why Use This?
The traditional fix—passing pre-allocated buffers through your call stack—works but requires invasive refactoring and clutters your APIs.

In high-performance computing, allocating temporary arrays inside a loop creates significant GC pressure, causing stuttering and performance degradation. Manual in-place operations (passing pre-allocated buffers) avoid this but require tedious buffer management and argument passing, making code complex and error-prone.
## The Solution

```julia
using LinearAlgebra, Random
using BenchmarkTools

# ❌ Naive Approach: Allocates new arrays every single call
function compute_naive(n::Int)
mat1 = rand(n, n) # Allocation!
mat2 = rand(n, n) # Allocation!

mat3 = mat1 * mat2 # Allocation!
return sum(mat3)
end

# ✅ Pooled Approach: Zero allocations in steady state, clean syntax (no manual buffer passing)
@with_pool pool function compute_pooled(n::Int)
# Get ReshapedArray views from auto-managed pool (0 bytes allocation)
mat1 = acquire!(pool, Float64, n, n)
mat2 = acquire!(pool, Float64, n, n)
mat3 = acquire!(pool, Float64, n, n)

# Use In-place functions without allocations
Random.rand!(mat1)
Random.rand!(mat2)
mul!(mat3, mat1, mat2)
return sum(mat3)
end

# Naive: Large temporary allocations cause GC pressure
@benchmark compute_naive(2000)
# Time (mean ± σ): 67.771 ms ± 31.818 ms ⚠️ ┊ GC (mean ± σ): 17.02% ± 18.69% ⚠️
# Memory estimate: 91.59 MiB ⚠️, allocs estimate: 9.

# Pooled: Zero allocations, no GC pressure
@benchmark compute_pooled(2000)
# Time (mean ± σ): 57.647 ms ± 3.960 ms ✅ ┊ GC (mean ± σ): 0.00% ± 0.00% ✅
# Memory estimate: 0 bytes ✅, allocs estimate: 0.
```

> **Performance Note:**
> - **vs Manual Pre-allocation**: This library achieves performance comparable to manually passing pre-allocated buffers (in-place operations), but without the boilerplate of managing buffer lifecycles.
> - **Low Overhead**: The overhead of `@with_pool` (including checkpoint/rewind) is typically **tens of nanoseconds** (< 100 ns), making it negligible for most workloads compared to the cost of memory allocation.

## Important: User Responsibility

This library prioritizes **zero-overhead performance** over runtime safety checks. Two fundamental rules must be followed:

1. **Scope Rule**: Arrays acquired from a pool are only valid within the `@with_pool` scope.
2. **Task Rule**: Pool objects must not be shared across Tasks (see [Multi-Threading Usage](#multi-threading-usage)).

When `@with_pool` ends, all acquired arrays are "rewound" and their memory becomes available for reuse. Using them after the scope ends leads to **undefined behavior** (data corruption, crashes).

<details>
<summary><b>Safe Patterns</b> (click to expand)</summary>
Wrap your function with `@with_pool` and use `acquire!` instead of allocation:

```julia
@with_pool pool function safe_example(n)
v = acquire!(pool, Float64, n)
v .= 1.0
using AdaptiveArrayPools, LinearAlgebra, Random

# ✅ Return computed values (scalars, tuples, etc.)
return sum(v), length(v)
end

@with_pool pool function safe_copy(n)
v = acquire!(pool, Float64, n)
v .= rand(n)

# ✅ Return a copy if you need the data outside
return copy(v)
end
```

</details>

<details>
<summary><b>Unsafe Patterns (DO NOT DO THIS)</b> (click to expand)</summary>
@with_pool pool function compute_pooled(n)
A = acquire!(pool, Float64, n, n) # reuses memory from pool
B = acquire!(pool, Float64, n, n)
C = acquire!(pool, Float64, n, n)

```julia
@with_pool pool function unsafe_return(n)
v = acquire!(pool, Float64, n)
v .= 1.0
return v # ❌ UNSAFE: Returning pool-backed array!
rand!(A); rand!(B)
mul!(C, A, B)
return sum(C)
end

result = unsafe_return(100)
# result now points to memory that may be overwritten!

# ❌ Also unsafe: storing in global variables, closures, etc.
global_storage = nothing
@with_pool pool begin
v = acquire!(pool, Float64, 100)
global_storage = v # ❌ UNSAFE: escaping via global
compute_pooled(100) # warmup
for i in 1:10_000
compute_pooled(100) # 0 bytes, 0% GC
end
```

</details>
| Approach | Memory | GC Time | Code Complexity |
|----------|--------|---------|-----------------|
| Naive allocation | 91 MiB | 17% | Simple |
| Manual buffer passing | 0 | 0% | Complex, invasive refactor |
| **AdaptiveArrayPools** | **0** | **0%** | **Minimal change** |

<details>
<summary><b>Debugging with POOL_DEBUG</b> (click to expand)</summary>
> **CUDA support**: Same API—just use `@with_pool :cuda pool`. See [CUDA Backend](docs/cuda.md).

Enable `POOL_DEBUG` to catch direct returns of pool-backed arrays:
## How It Works

```julia
POOL_DEBUG[] = true # Enable safety checks

@with_pool pool begin
v = acquire!(pool, Float64, 10)
v # Throws ErrorException: "Returning pool-backed array..."
end
```

> **Note:** `POOL_DEBUG` only catches direct returns, not indirect escapes (globals, closures). It's a development aid, not a guarantee.
`@with_pool` automatically manages memory lifecycle for you:

</details>
1. **Checkpoint** — Saves current pool state when entering the block
2. **Acquire** — `acquire!` returns arrays backed by pooled memory
3. **Rewind** — When the block ends, all acquired arrays are recycled for reuse

## Key Features
This automatic checkpoint/rewind cycle is what enables zero allocation on repeated calls. You just write normal-looking code with `acquire!` instead of constructors.

- **`acquire!` — True Zero Allocation**: Returns lightweight views (`SubArray` for 1D, `ReshapedArray` for N-D) that are created on the stack. **Always 0 bytes**, regardless of dimension patterns or cache state.
- **`unsafe_acquire!` — Cached Allocation**: Returns concrete `Array` types (`Vector{T}` for 1D, `Array{T,N}` for N-D) for FFI/type constraints.
- All dimensions use N-way set-associative cache (default: 4-way) → **0 bytes on cache hit**, ~100 bytes on cache miss.
- Increase `CACHE_WAYS` if you alternate between >4 dimension patterns per slot.
- Even on cache miss, this is just the `Array` header (metadata)—**actual data memory is always reused from the pool**.
- **Low Overhead**: Optimized to have < 100 ns overhead for pool management, suitable for tight inner loops.
- **Task-Local Isolation**: Each Task gets its own pool via `task_local_storage()`. Thread-safe when `@with_pool` is called within each task's scope (see [Multi-Threading Usage](#multi-threading-usage) below).
- **Type Stable**: Optimized for `Float64`, `Int`, and other common types using fixed-slot caching.
- **Non-Intrusive**: If you disable pooling via preferences, `acquire!` compiles down to a standard `Array` allocation.
- **Flexible API**: Use `acquire!` for safe views (recommended), or `unsafe_acquire!` when concrete `Array` type is required (FFI, type constraints).
`acquire!` returns lightweight views (`SubArray`, `ReshapedArray`) that work seamlessly with BLAS/LAPACK. If you need native `Array` types (FFI, type constraints), use `unsafe_acquire!`—see [API Reference](docs/api.md).

## Multi-Threading Usage
> **Note**: Keeping acquired arrays inside the scope is your responsibility. Return computed values (scalars, copies), not the arrays themselves. See [Safety Guide](docs/safety.md).

AdaptiveArrayPools uses `task_local_storage()` for **task-local isolation**: each Julia Task gets its own independent pool.
**Thread-safe by design**: Each Julia Task gets its own independent pool, so `@with_pool` inside threaded code is automatically safe:

```julia
# ✅ SAFE: @with_pool inside @threads
Threads.@threads for i in 1:N
@with_pool pool begin
a = acquire!(pool, Float64, 100)
# each thread has its own pool — no race conditions
end
end

# ❌ UNSAFE: @with_pool outside @threads (race condition!)
@with_pool pool Threads.@threads for i in 1:N
a = acquire!(pool, Float64, 100) # All threads share one pool!
end
```

| Pattern | Safety |
|---------|--------|
| `@with_pool` inside `@threads` | ✅ Safe |
| `@with_pool` outside `@threads` | ❌ Unsafe |
| Function with `@with_pool` called from `@threads` | ✅ Safe |

> **Important**: Pool objects must not be shared across Tasks. This library does not add locks—correct usage is the user's responsibility.

For detailed explanation including Julia's Task/Thread model and why thread-local pools don't work, see **[Multi-Threading Guide](docs/multi-threading.md)**.

## `acquire!` vs `unsafe_acquire!`

**In most cases, use `acquire!`**. It returns view types (`SubArray` for 1D, `ReshapedArray` for N-D) that are safe and always zero-allocation.

> **Performance Note**: BLAS/LAPACK functions (`mul!`, `lu!`, etc.) are fully optimized for `StridedArray`—there is **no performance difference** between views and raw arrays. Benchmarks show identical throughput.

Use `unsafe_acquire!` **only** when a concrete `Array{T,N}` type is required:
- **FFI/C interop**: External libraries expecting `Ptr{T}` from `Array`
- **Type constraints**: APIs that explicitly require `Matrix{T}` or `Vector{T}`, or type-unstable code where concrete types reduce dispatch overhead

```julia
@with_pool pool begin
# ✅ Recommended: acquire! for general use (always 0 bytes)
A = acquire!(pool, Float64, 100, 100) # ReshapedArray
B = acquire!(pool, Float64, 100, 100) # ReshapedArray
C = acquire!(pool, Float64, 100, 100) # ReshapedArray
mul!(C, A, B) # ✅ BLAS works perfectly with views!

# ⚠️ Only when concrete Array type is required:
M = unsafe_acquire!(pool, Float64, 100, 100) # Matrix{Float64}
ccall(:some_c_function, Cvoid, (Ptr{Float64},), M) # FFI needs Array
end
```

| Function | 1D Return | N-D Return | Allocation |
|----------|-----------|------------|------------|
| `acquire!` | `SubArray{T,1}` | `ReshapedArray{T,N}` | Always 0 bytes |
| `unsafe_acquire!` | `Vector{T}` | `Array{T,N}` | 0 bytes (hit) / ~100 bytes header (miss) |

> **Note**: `unsafe_acquire!` always returns concrete `Array` types (including `Vector` for 1D). The N-way cache applies to all dimensions—up to `CACHE_WAYS` (default: 4) dimension patterns per slot; exceeding this causes header-only allocation per miss.

> **Warning**: Both functions return memory only valid within the `@with_pool` scope. Do NOT call `resize!`, `push!`, or `append!` on acquired arrays.

### API Aliases

For explicit naming, you can use these aliases:
## Installation

```julia
acquire_view!(pool, T, dims...) # Same as acquire! → returns view types
acquire_array!(pool, T, dims...) # Same as unsafe_acquire! → returns Array
using Pkg
Pkg.Registry.add(Pkg.RegistrySpec(url="https://github.com/ProjectTorreyPines/FuseRegistry.jl.git"))
Pkg.add("AdaptiveArrayPools")
```

## Documentation

- [API Reference](docs/api.md) - Macros, functions, and types
- [Multi-Threading Guide](docs/multi-threading.md) - Task/Thread model, safe patterns, and design rationale
- [Runtime Toggle: @maybe_with_pool](docs/maybe_with_pool.md) - Control pooling at runtime
- [Configuration](docs/configuration.md) - Preferences.jl integration

## Configuration

Configure AdaptiveArrayPools via `LocalPreferences.toml`:

```toml
[AdaptiveArrayPools]
use_pooling = false # ⭐ Primary: Disable pooling entirely
cache_ways = 8 # Secondary: N-way cache size (default: 4)
```

### Disabling Pooling (Primary Use Case)

The most important configuration is **`use_pooling = false`**, which completely disables all pooling:

```julia
# With use_pooling = false, acquire! becomes equivalent to:
acquire!(pool, Float64, n, n) → Matrix{Float64}(undef, n, n)
```

This is useful for:
- **Debugging**: Isolate pooling-related issues by comparing behavior
- **Benchmarking**: Measure pooling overhead vs direct allocation
- **Gradual adoption**: Add `@with_pool` to code without changing behavior until ready

When disabled, all macros generate `pool = nothing` and `acquire!` falls back to standard allocation with **zero overhead**.

### N-way Cache Tuning (Advanced)

```julia
using AdaptiveArrayPools
set_cache_ways!(8) # Requires Julia restart
```

Increase `cache_ways` if alternating between >4 dimension patterns per slot.
| Guide | Description |
|-------|-------------|
| [API Reference](docs/api.md) | Complete function and macro reference |
| [CUDA Backend](docs/cuda.md) | GPU-specific usage and examples |
| [Safety Guide](docs/safety.md) | Scope rules and best practices |
| [Multi-Threading](docs/multi-threading.md) | Task/thread safety patterns |
| [Configuration](docs/configuration.md) | Preferences and cache tuning |

## License

Expand Down
2 changes: 1 addition & 1 deletion docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
| `acquire!(pool, T, dims...)` | Returns a view: `SubArray{T,1}` for 1D, `ReshapedArray{T,N}` for N-D. Always 0 bytes. |
| `acquire!(pool, T, dims::Tuple)` | Tuple overload for `acquire!` (e.g., `acquire!(pool, T, size(x))`). |
| `acquire!(pool, x::AbstractArray)` | Similar-style: acquires array matching `eltype(x)` and `size(x)`. |
| `unsafe_acquire!(pool, T, dims...)` | Returns `SubArray{T,1}` for 1D, raw `Array{T,N}` for N-D. Only for FFI/type constraints. |
| `unsafe_acquire!(pool, T, dims...)` | Returns native `Array`/`CuArray` (CPU: `Vector{T}` for 1D, `Array{T,N}` for N-D). Only for FFI/type constraints. |
| `unsafe_acquire!(pool, T, dims::Tuple)` | Tuple overload for `unsafe_acquire!`. |
| `unsafe_acquire!(pool, x::AbstractArray)` | Similar-style: acquires raw array matching `eltype(x)` and `size(x)`. |
| `acquire_view!(pool, T, dims...)` | Alias for `acquire!`. Returns view types. |
Expand Down
Loading