Skip to content

Commit 65e66b7

Browse files
authored
Merge pull request #9 from ProjectTorreyPines/feat/cuda_backend
Add CUDA Backend Support
2 parents a1bc284 + 79299ac commit 65e66b7

35 files changed

+4458
-431
lines changed

.github/workflows/CI.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ jobs:
4141
- uses: julia-actions/julia-runtest@v1
4242

4343
- uses: julia-actions/julia-processcoverage@v1
44+
with:
45+
directories: src
4446

4547
- uses: codecov/codecov-action@v4
4648
with:

Project.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,9 @@ authors = ["Min-Gu Yoo <mgyoo86@gmail.com>"]
66
[deps]
77
Preferences = "21216c6a-2e73-6563-6e65-726566657250"
88
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
9+
10+
[weakdeps]
11+
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
12+
13+
[extensions]
14+
AdaptiveArrayPoolsCUDAExt = "CUDA"

README.md

Lines changed: 55 additions & 260 deletions
Original file line numberDiff line numberDiff line change
@@ -3,311 +3,106 @@
33

44
# AdaptiveArrayPools.jl
55

6-
**Zero-allocation array pooling for Julia.**
7-
Reuse temporary arrays to eliminate Garbage Collection (GC) pressure in high-performance hot loops.
6+
**Zero-allocation temporary arrays for Julia.**
87

9-
## Installation
8+
A lightweight library that lets you write natural, allocation-style code while automatically reusing memory behind the scenes. Eliminates GC pressure in hot loops without the complexity of manual buffer management.
109

11-
`AdaptiveArrayPools` is registered with [FuseRegistry](https://github.com/ProjectTorreyPines/FuseRegistry.jl/):
10+
**Supported backends:**
11+
- **CPU**`Array`, works out of the box
12+
- **CUDA**`CuArray`, loads automatically when [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) is available
1213

13-
```julia
14-
using Pkg
15-
Pkg.Registry.add(RegistrySpec(url="https://github.com/ProjectTorreyPines/FuseRegistry.jl.git"))
16-
Pkg.Registry.add("General")
17-
Pkg.add("AdaptiveArrayPools")
18-
```
14+
## The Problem
1915

20-
## Quick Start
16+
In performance-critical code, temporary array allocations inside loops create massive GC pressure:
2117

2218
```julia
23-
using AdaptiveArrayPools, LinearAlgebra
24-
25-
# 1. Define the hot-loop function with automatic pooling for ZERO-ALLOCATION
26-
@with_pool pool function heavy_computation_step(n)
27-
# Safe Default: Returns ReshapedArray for N-D (always 0 bytes, prevents resize!)
28-
A = acquire!(pool, Float64, n, n)
29-
B = acquire!(pool, Float64, n, n)
30-
31-
# Power User: Returns raw Matrix{Float64} (only for FFI/type constraints)
32-
# ⚠️ Must NOT resize! or escape scope
33-
C = unsafe_acquire!(pool, Float64, n, n)
34-
35-
# Use them like normal arrays
36-
fill!(A, 1.0); fill!(B, 2.0)
37-
38-
# Pass to inner functions as needed
39-
complex_inner_logic!(C, A, B)
40-
41-
return sum(C)
42-
# ⚠️ Arrays A, B, C must not escape this scope; they become invalid after this function returns!
19+
function compute_naive(n)
20+
A = rand(n, n) # allocates
21+
B = rand(n, n) # allocates
22+
C = A * B # allocates
23+
return sum(C)
4324
end
4425

45-
# Standard Julia function (unaware of pooling)
46-
function complex_inner_logic!(C, A, B)
47-
mul!(C, A, B)
48-
end
49-
50-
# 2. Main application entry point
51-
function main_simulation_loop()
52-
# ... complex setup logic ...
53-
54-
total = 0.0
55-
# This loop would normally generate massive GC pressure
56-
for i in 1:1000
57-
# ✅ Zero allocation here after the first iteration!
58-
total += heavy_computation_step(100)
59-
end
60-
61-
return total
26+
for i in 1:10_000
27+
compute_naive(100) # 91 MiB total, 17% GC time
6228
end
63-
64-
# Run simulation
65-
main_simulation_loop()
6629
```
6730

68-
## Why Use This?
31+
The traditional fix—passing pre-allocated buffers through your call stack—works but requires invasive refactoring and clutters your APIs.
6932

70-
In high-performance computing, allocating temporary arrays inside a loop creates significant GC pressure, causing stuttering and performance degradation. Manual in-place operations (passing pre-allocated buffers) avoid this but require tedious buffer management and argument passing, making code complex and error-prone.
33+
## The Solution
7134

72-
```julia
73-
using LinearAlgebra, Random
74-
using BenchmarkTools
75-
76-
# ❌ Naive Approach: Allocates new arrays every single call
77-
function compute_naive(n::Int)
78-
mat1 = rand(n, n) # Allocation!
79-
mat2 = rand(n, n) # Allocation!
80-
81-
mat3 = mat1 * mat2 # Allocation!
82-
return sum(mat3)
83-
end
84-
85-
# ✅ Pooled Approach: Zero allocations in steady state, clean syntax (no manual buffer passing)
86-
@with_pool pool function compute_pooled(n::Int)
87-
# Get ReshapedArray views from auto-managed pool (0 bytes allocation)
88-
mat1 = acquire!(pool, Float64, n, n)
89-
mat2 = acquire!(pool, Float64, n, n)
90-
mat3 = acquire!(pool, Float64, n, n)
91-
92-
# Use In-place functions without allocations
93-
Random.rand!(mat1)
94-
Random.rand!(mat2)
95-
mul!(mat3, mat1, mat2)
96-
return sum(mat3)
97-
end
98-
99-
# Naive: Large temporary allocations cause GC pressure
100-
@benchmark compute_naive(2000)
101-
# Time (mean ± σ): 67.771 ms ± 31.818 ms ⚠️ ┊ GC (mean ± σ): 17.02% ± 18.69% ⚠️
102-
# Memory estimate: 91.59 MiB ⚠️, allocs estimate: 9.
103-
104-
# Pooled: Zero allocations, no GC pressure
105-
@benchmark compute_pooled(2000)
106-
# Time (mean ± σ): 57.647 ms ± 3.960 ms ✅ ┊ GC (mean ± σ): 0.00% ± 0.00% ✅
107-
# Memory estimate: 0 bytes ✅, allocs estimate: 0.
108-
```
109-
110-
> **Performance Note:**
111-
> - **vs Manual Pre-allocation**: This library achieves performance comparable to manually passing pre-allocated buffers (in-place operations), but without the boilerplate of managing buffer lifecycles.
112-
> - **Low Overhead**: The overhead of `@with_pool` (including checkpoint/rewind) is typically **tens of nanoseconds** (< 100 ns), making it negligible for most workloads compared to the cost of memory allocation.
113-
114-
## Important: User Responsibility
115-
116-
This library prioritizes **zero-overhead performance** over runtime safety checks. Two fundamental rules must be followed:
117-
118-
1. **Scope Rule**: Arrays acquired from a pool are only valid within the `@with_pool` scope.
119-
2. **Task Rule**: Pool objects must not be shared across Tasks (see [Multi-Threading Usage](#multi-threading-usage)).
120-
121-
When `@with_pool` ends, all acquired arrays are "rewound" and their memory becomes available for reuse. Using them after the scope ends leads to **undefined behavior** (data corruption, crashes).
122-
123-
<details>
124-
<summary><b>Safe Patterns</b> (click to expand)</summary>
35+
Wrap your function with `@with_pool` and use `acquire!` instead of allocation:
12536

12637
```julia
127-
@with_pool pool function safe_example(n)
128-
v = acquire!(pool, Float64, n)
129-
v .= 1.0
38+
using AdaptiveArrayPools, LinearAlgebra, Random
13039

131-
# ✅ Return computed values (scalars, tuples, etc.)
132-
return sum(v), length(v)
133-
end
134-
135-
@with_pool pool function safe_copy(n)
136-
v = acquire!(pool, Float64, n)
137-
v .= rand(n)
138-
139-
# ✅ Return a copy if you need the data outside
140-
return copy(v)
141-
end
142-
```
143-
144-
</details>
145-
146-
<details>
147-
<summary><b>Unsafe Patterns (DO NOT DO THIS)</b> (click to expand)</summary>
40+
@with_pool pool function compute_pooled(n)
41+
A = acquire!(pool, Float64, n, n) # reuses memory from pool
42+
B = acquire!(pool, Float64, n, n)
43+
C = acquire!(pool, Float64, n, n)
14844

149-
```julia
150-
@with_pool pool function unsafe_return(n)
151-
v = acquire!(pool, Float64, n)
152-
v .= 1.0
153-
return v # ❌ UNSAFE: Returning pool-backed array!
45+
rand!(A); rand!(B)
46+
mul!(C, A, B)
47+
return sum(C)
15448
end
15549

156-
result = unsafe_return(100)
157-
# result now points to memory that may be overwritten!
158-
159-
# ❌ Also unsafe: storing in global variables, closures, etc.
160-
global_storage = nothing
161-
@with_pool pool begin
162-
v = acquire!(pool, Float64, 100)
163-
global_storage = v # ❌ UNSAFE: escaping via global
50+
compute_pooled(100) # warmup
51+
for i in 1:10_000
52+
compute_pooled(100) # 0 bytes, 0% GC
16453
end
16554
```
16655

167-
</details>
56+
| Approach | Memory | GC Time | Code Complexity |
57+
|----------|--------|---------|-----------------|
58+
| Naive allocation | 91 MiB | 17% | Simple |
59+
| Manual buffer passing | 0 | 0% | Complex, invasive refactor |
60+
| **AdaptiveArrayPools** | **0** | **0%** | **Minimal change** |
16861

169-
<details>
170-
<summary><b>Debugging with POOL_DEBUG</b> (click to expand)</summary>
62+
> **CUDA support**: Same API—just use `@with_pool :cuda pool`. See [CUDA Backend](docs/cuda.md).
17163
172-
Enable `POOL_DEBUG` to catch direct returns of pool-backed arrays:
64+
## How It Works
17365

174-
```julia
175-
POOL_DEBUG[] = true # Enable safety checks
176-
177-
@with_pool pool begin
178-
v = acquire!(pool, Float64, 10)
179-
v # Throws ErrorException: "Returning pool-backed array..."
180-
end
181-
```
182-
183-
> **Note:** `POOL_DEBUG` only catches direct returns, not indirect escapes (globals, closures). It's a development aid, not a guarantee.
66+
`@with_pool` automatically manages memory lifecycle for you:
18467

185-
</details>
68+
1. **Checkpoint** — Saves current pool state when entering the block
69+
2. **Acquire**`acquire!` returns arrays backed by pooled memory
70+
3. **Rewind** — When the block ends, all acquired arrays are recycled for reuse
18671

187-
## Key Features
72+
This automatic checkpoint/rewind cycle is what enables zero allocation on repeated calls. You just write normal-looking code with `acquire!` instead of constructors.
18873

189-
- **`acquire!` — True Zero Allocation**: Returns lightweight views (`SubArray` for 1D, `ReshapedArray` for N-D) that are created on the stack. **Always 0 bytes**, regardless of dimension patterns or cache state.
190-
- **`unsafe_acquire!` — Cached Allocation**: Returns concrete `Array` types (`Vector{T}` for 1D, `Array{T,N}` for N-D) for FFI/type constraints.
191-
- All dimensions use N-way set-associative cache (default: 4-way) → **0 bytes on cache hit**, ~100 bytes on cache miss.
192-
- Increase `CACHE_WAYS` if you alternate between >4 dimension patterns per slot.
193-
- Even on cache miss, this is just the `Array` header (metadata)—**actual data memory is always reused from the pool**.
194-
- **Low Overhead**: Optimized to have < 100 ns overhead for pool management, suitable for tight inner loops.
195-
- **Task-Local Isolation**: Each Task gets its own pool via `task_local_storage()`. Thread-safe when `@with_pool` is called within each task's scope (see [Multi-Threading Usage](#multi-threading-usage) below).
196-
- **Type Stable**: Optimized for `Float64`, `Int`, and other common types using fixed-slot caching.
197-
- **Non-Intrusive**: If you disable pooling via preferences, `acquire!` compiles down to a standard `Array` allocation.
198-
- **Flexible API**: Use `acquire!` for safe views (recommended), or `unsafe_acquire!` when concrete `Array` type is required (FFI, type constraints).
74+
`acquire!` returns lightweight views (`SubArray`, `ReshapedArray`) that work seamlessly with BLAS/LAPACK. If you need native `Array` types (FFI, type constraints), use `unsafe_acquire!`—see [API Reference](docs/api.md).
19975

200-
## Multi-Threading Usage
76+
> **Note**: Keeping acquired arrays inside the scope is your responsibility. Return computed values (scalars, copies), not the arrays themselves. See [Safety Guide](docs/safety.md).
20177
202-
AdaptiveArrayPools uses `task_local_storage()` for **task-local isolation**: each Julia Task gets its own independent pool.
78+
**Thread-safe by design**: Each Julia Task gets its own independent pool, so `@with_pool` inside threaded code is automatically safe:
20379

20480
```julia
205-
# ✅ SAFE: @with_pool inside @threads
20681
Threads.@threads for i in 1:N
20782
@with_pool pool begin
20883
a = acquire!(pool, Float64, 100)
84+
# each thread has its own pool — no race conditions
20985
end
21086
end
211-
212-
# ❌ UNSAFE: @with_pool outside @threads (race condition!)
213-
@with_pool pool Threads.@threads for i in 1:N
214-
a = acquire!(pool, Float64, 100) # All threads share one pool!
215-
end
21687
```
21788

218-
| Pattern | Safety |
219-
|---------|--------|
220-
| `@with_pool` inside `@threads` | ✅ Safe |
221-
| `@with_pool` outside `@threads` | ❌ Unsafe |
222-
| Function with `@with_pool` called from `@threads` | ✅ Safe |
223-
224-
> **Important**: Pool objects must not be shared across Tasks. This library does not add locks—correct usage is the user's responsibility.
225-
226-
For detailed explanation including Julia's Task/Thread model and why thread-local pools don't work, see **[Multi-Threading Guide](docs/multi-threading.md)**.
227-
228-
## `acquire!` vs `unsafe_acquire!`
229-
230-
**In most cases, use `acquire!`**. It returns view types (`SubArray` for 1D, `ReshapedArray` for N-D) that are safe and always zero-allocation.
231-
232-
> **Performance Note**: BLAS/LAPACK functions (`mul!`, `lu!`, etc.) are fully optimized for `StridedArray`—there is **no performance difference** between views and raw arrays. Benchmarks show identical throughput.
233-
234-
Use `unsafe_acquire!` **only** when a concrete `Array{T,N}` type is required:
235-
- **FFI/C interop**: External libraries expecting `Ptr{T}` from `Array`
236-
- **Type constraints**: APIs that explicitly require `Matrix{T}` or `Vector{T}`, or type-unstable code where concrete types reduce dispatch overhead
237-
238-
```julia
239-
@with_pool pool begin
240-
# ✅ Recommended: acquire! for general use (always 0 bytes)
241-
A = acquire!(pool, Float64, 100, 100) # ReshapedArray
242-
B = acquire!(pool, Float64, 100, 100) # ReshapedArray
243-
C = acquire!(pool, Float64, 100, 100) # ReshapedArray
244-
mul!(C, A, B) # ✅ BLAS works perfectly with views!
245-
246-
# ⚠️ Only when concrete Array type is required:
247-
M = unsafe_acquire!(pool, Float64, 100, 100) # Matrix{Float64}
248-
ccall(:some_c_function, Cvoid, (Ptr{Float64},), M) # FFI needs Array
249-
end
250-
```
251-
252-
| Function | 1D Return | N-D Return | Allocation |
253-
|----------|-----------|------------|------------|
254-
| `acquire!` | `SubArray{T,1}` | `ReshapedArray{T,N}` | Always 0 bytes |
255-
| `unsafe_acquire!` | `Vector{T}` | `Array{T,N}` | 0 bytes (hit) / ~100 bytes header (miss) |
256-
257-
> **Note**: `unsafe_acquire!` always returns concrete `Array` types (including `Vector` for 1D). The N-way cache applies to all dimensions—up to `CACHE_WAYS` (default: 4) dimension patterns per slot; exceeding this causes header-only allocation per miss.
258-
259-
> **Warning**: Both functions return memory only valid within the `@with_pool` scope. Do NOT call `resize!`, `push!`, or `append!` on acquired arrays.
260-
261-
### API Aliases
262-
263-
For explicit naming, you can use these aliases:
89+
## Installation
26490

26591
```julia
266-
acquire_view!(pool, T, dims...) # Same as acquire! → returns view types
267-
acquire_array!(pool, T, dims...) # Same as unsafe_acquire! → returns Array
92+
using Pkg
93+
Pkg.Registry.add(Pkg.RegistrySpec(url="https://github.com/ProjectTorreyPines/FuseRegistry.jl.git"))
94+
Pkg.add("AdaptiveArrayPools")
26895
```
26996

27097
## Documentation
27198

272-
- [API Reference](docs/api.md) - Macros, functions, and types
273-
- [Multi-Threading Guide](docs/multi-threading.md) - Task/Thread model, safe patterns, and design rationale
274-
- [Runtime Toggle: @maybe_with_pool](docs/maybe_with_pool.md) - Control pooling at runtime
275-
- [Configuration](docs/configuration.md) - Preferences.jl integration
276-
277-
## Configuration
278-
279-
Configure AdaptiveArrayPools via `LocalPreferences.toml`:
280-
281-
```toml
282-
[AdaptiveArrayPools]
283-
use_pooling = false # ⭐ Primary: Disable pooling entirely
284-
cache_ways = 8 # Secondary: N-way cache size (default: 4)
285-
```
286-
287-
### Disabling Pooling (Primary Use Case)
288-
289-
The most important configuration is **`use_pooling = false`**, which completely disables all pooling:
290-
291-
```julia
292-
# With use_pooling = false, acquire! becomes equivalent to:
293-
acquire!(pool, Float64, n, n) Matrix{Float64}(undef, n, n)
294-
```
295-
296-
This is useful for:
297-
- **Debugging**: Isolate pooling-related issues by comparing behavior
298-
- **Benchmarking**: Measure pooling overhead vs direct allocation
299-
- **Gradual adoption**: Add `@with_pool` to code without changing behavior until ready
300-
301-
When disabled, all macros generate `pool = nothing` and `acquire!` falls back to standard allocation with **zero overhead**.
302-
303-
### N-way Cache Tuning (Advanced)
304-
305-
```julia
306-
using AdaptiveArrayPools
307-
set_cache_ways!(8) # Requires Julia restart
308-
```
309-
310-
Increase `cache_ways` if alternating between >4 dimension patterns per slot.
99+
| Guide | Description |
100+
|-------|-------------|
101+
| [API Reference](docs/api.md) | Complete function and macro reference |
102+
| [CUDA Backend](docs/cuda.md) | GPU-specific usage and examples |
103+
| [Safety Guide](docs/safety.md) | Scope rules and best practices |
104+
| [Multi-Threading](docs/multi-threading.md) | Task/thread safety patterns |
105+
| [Configuration](docs/configuration.md) | Preferences and cache tuning |
311106

312107
## License
313108

docs/api.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
| `acquire!(pool, T, dims...)` | Returns a view: `SubArray{T,1}` for 1D, `ReshapedArray{T,N}` for N-D. Always 0 bytes. |
1515
| `acquire!(pool, T, dims::Tuple)` | Tuple overload for `acquire!` (e.g., `acquire!(pool, T, size(x))`). |
1616
| `acquire!(pool, x::AbstractArray)` | Similar-style: acquires array matching `eltype(x)` and `size(x)`. |
17-
| `unsafe_acquire!(pool, T, dims...)` | Returns `SubArray{T,1}` for 1D, raw `Array{T,N}` for N-D. Only for FFI/type constraints. |
17+
| `unsafe_acquire!(pool, T, dims...)` | Returns native `Array`/`CuArray` (CPU: `Vector{T}` for 1D, `Array{T,N}` for N-D). Only for FFI/type constraints. |
1818
| `unsafe_acquire!(pool, T, dims::Tuple)` | Tuple overload for `unsafe_acquire!`. |
1919
| `unsafe_acquire!(pool, x::AbstractArray)` | Similar-style: acquires raw array matching `eltype(x)` and `size(x)`. |
2020
| `acquire_view!(pool, T, dims...)` | Alias for `acquire!`. Returns view types. |

0 commit comments

Comments
 (0)