|
3 | 3 |
|
4 | 4 | # AdaptiveArrayPools.jl |
5 | 5 |
|
6 | | -**Zero-allocation array pooling for Julia.** |
7 | | -Reuse temporary arrays to eliminate Garbage Collection (GC) pressure in high-performance hot loops. |
| 6 | +**Zero-allocation temporary arrays for Julia.** |
8 | 7 |
|
9 | | -## Installation |
| 8 | +A lightweight library that lets you write natural, allocation-style code while automatically reusing memory behind the scenes. Eliminates GC pressure in hot loops without the complexity of manual buffer management. |
10 | 9 |
|
11 | | -`AdaptiveArrayPools` is registered with [FuseRegistry](https://github.com/ProjectTorreyPines/FuseRegistry.jl/): |
| 10 | +**Supported backends:** |
| 11 | +- **CPU** — `Array`, works out of the box |
| 12 | +- **CUDA** — `CuArray`, loads automatically when [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) is available |
12 | 13 |
|
13 | | -```julia |
14 | | -using Pkg |
15 | | -Pkg.Registry.add(RegistrySpec(url="https://github.com/ProjectTorreyPines/FuseRegistry.jl.git")) |
16 | | -Pkg.Registry.add("General") |
17 | | -Pkg.add("AdaptiveArrayPools") |
18 | | -``` |
| 14 | +## The Problem |
19 | 15 |
|
20 | | -## Quick Start |
| 16 | +In performance-critical code, temporary array allocations inside loops create massive GC pressure: |
21 | 17 |
|
22 | 18 | ```julia |
23 | | -using AdaptiveArrayPools, LinearAlgebra |
24 | | - |
25 | | -# 1. Define the hot-loop function with automatic pooling for ZERO-ALLOCATION |
26 | | -@with_pool pool function heavy_computation_step(n) |
27 | | - # Safe Default: Returns ReshapedArray for N-D (always 0 bytes, prevents resize!) |
28 | | - A = acquire!(pool, Float64, n, n) |
29 | | - B = acquire!(pool, Float64, n, n) |
30 | | - |
31 | | - # Power User: Returns raw Matrix{Float64} (only for FFI/type constraints) |
32 | | - # ⚠️ Must NOT resize! or escape scope |
33 | | - C = unsafe_acquire!(pool, Float64, n, n) |
34 | | - |
35 | | - # Use them like normal arrays |
36 | | - fill!(A, 1.0); fill!(B, 2.0) |
37 | | - |
38 | | - # Pass to inner functions as needed |
39 | | - complex_inner_logic!(C, A, B) |
40 | | - |
41 | | - return sum(C) |
42 | | - # ⚠️ Arrays A, B, C must not escape this scope; they become invalid after this function returns! |
| 19 | +function compute_naive(n) |
| 20 | + A = rand(n, n) # allocates |
| 21 | + B = rand(n, n) # allocates |
| 22 | + C = A * B # allocates |
| 23 | + return sum(C) |
43 | 24 | end |
44 | 25 |
|
45 | | -# Standard Julia function (unaware of pooling) |
46 | | -function complex_inner_logic!(C, A, B) |
47 | | - mul!(C, A, B) |
48 | | -end |
49 | | - |
50 | | -# 2. Main application entry point |
51 | | -function main_simulation_loop() |
52 | | - # ... complex setup logic ... |
53 | | - |
54 | | - total = 0.0 |
55 | | - # This loop would normally generate massive GC pressure |
56 | | - for i in 1:1000 |
57 | | - # ✅ Zero allocation here after the first iteration! |
58 | | - total += heavy_computation_step(100) |
59 | | - end |
60 | | - |
61 | | - return total |
| 26 | +for i in 1:10_000 |
| 27 | + compute_naive(100) # 91 MiB total, 17% GC time |
62 | 28 | end |
63 | | - |
64 | | -# Run simulation |
65 | | -main_simulation_loop() |
66 | 29 | ``` |
67 | 30 |
|
68 | | -## Why Use This? |
| 31 | +The traditional fix—passing pre-allocated buffers through your call stack—works but requires invasive refactoring and clutters your APIs. |
69 | 32 |
|
70 | | -In high-performance computing, allocating temporary arrays inside a loop creates significant GC pressure, causing stuttering and performance degradation. Manual in-place operations (passing pre-allocated buffers) avoid this but require tedious buffer management and argument passing, making code complex and error-prone. |
| 33 | +## The Solution |
71 | 34 |
|
72 | | -```julia |
73 | | -using LinearAlgebra, Random |
74 | | -using BenchmarkTools |
75 | | - |
76 | | -# ❌ Naive Approach: Allocates new arrays every single call |
77 | | -function compute_naive(n::Int) |
78 | | - mat1 = rand(n, n) # Allocation! |
79 | | - mat2 = rand(n, n) # Allocation! |
80 | | - |
81 | | - mat3 = mat1 * mat2 # Allocation! |
82 | | - return sum(mat3) |
83 | | -end |
84 | | - |
85 | | -# ✅ Pooled Approach: Zero allocations in steady state, clean syntax (no manual buffer passing) |
86 | | -@with_pool pool function compute_pooled(n::Int) |
87 | | - # Get ReshapedArray views from auto-managed pool (0 bytes allocation) |
88 | | - mat1 = acquire!(pool, Float64, n, n) |
89 | | - mat2 = acquire!(pool, Float64, n, n) |
90 | | - mat3 = acquire!(pool, Float64, n, n) |
91 | | - |
92 | | - # Use In-place functions without allocations |
93 | | - Random.rand!(mat1) |
94 | | - Random.rand!(mat2) |
95 | | - mul!(mat3, mat1, mat2) |
96 | | - return sum(mat3) |
97 | | -end |
98 | | - |
99 | | -# Naive: Large temporary allocations cause GC pressure |
100 | | -@benchmark compute_naive(2000) |
101 | | -# Time (mean ± σ): 67.771 ms ± 31.818 ms ⚠️ ┊ GC (mean ± σ): 17.02% ± 18.69% ⚠️ |
102 | | -# Memory estimate: 91.59 MiB ⚠️, allocs estimate: 9. |
103 | | - |
104 | | -# Pooled: Zero allocations, no GC pressure |
105 | | -@benchmark compute_pooled(2000) |
106 | | -# Time (mean ± σ): 57.647 ms ± 3.960 ms ✅ ┊ GC (mean ± σ): 0.00% ± 0.00% ✅ |
107 | | -# Memory estimate: 0 bytes ✅, allocs estimate: 0. |
108 | | -``` |
109 | | - |
110 | | -> **Performance Note:** |
111 | | -> - **vs Manual Pre-allocation**: This library achieves performance comparable to manually passing pre-allocated buffers (in-place operations), but without the boilerplate of managing buffer lifecycles. |
112 | | -> - **Low Overhead**: The overhead of `@with_pool` (including checkpoint/rewind) is typically **tens of nanoseconds** (< 100 ns), making it negligible for most workloads compared to the cost of memory allocation. |
113 | | -
|
114 | | -## Important: User Responsibility |
115 | | - |
116 | | -This library prioritizes **zero-overhead performance** over runtime safety checks. Two fundamental rules must be followed: |
117 | | - |
118 | | -1. **Scope Rule**: Arrays acquired from a pool are only valid within the `@with_pool` scope. |
119 | | -2. **Task Rule**: Pool objects must not be shared across Tasks (see [Multi-Threading Usage](#multi-threading-usage)). |
120 | | - |
121 | | -When `@with_pool` ends, all acquired arrays are "rewound" and their memory becomes available for reuse. Using them after the scope ends leads to **undefined behavior** (data corruption, crashes). |
122 | | - |
123 | | -<details> |
124 | | -<summary><b>Safe Patterns</b> (click to expand)</summary> |
| 35 | +Wrap your function with `@with_pool` and use `acquire!` instead of allocation: |
125 | 36 |
|
126 | 37 | ```julia |
127 | | -@with_pool pool function safe_example(n) |
128 | | - v = acquire!(pool, Float64, n) |
129 | | - v .= 1.0 |
| 38 | +using AdaptiveArrayPools, LinearAlgebra, Random |
130 | 39 |
|
131 | | - # ✅ Return computed values (scalars, tuples, etc.) |
132 | | - return sum(v), length(v) |
133 | | -end |
134 | | - |
135 | | -@with_pool pool function safe_copy(n) |
136 | | - v = acquire!(pool, Float64, n) |
137 | | - v .= rand(n) |
138 | | - |
139 | | - # ✅ Return a copy if you need the data outside |
140 | | - return copy(v) |
141 | | -end |
142 | | -``` |
143 | | - |
144 | | -</details> |
145 | | - |
146 | | -<details> |
147 | | -<summary><b>Unsafe Patterns (DO NOT DO THIS)</b> (click to expand)</summary> |
| 40 | +@with_pool pool function compute_pooled(n) |
| 41 | + A = acquire!(pool, Float64, n, n) # reuses memory from pool |
| 42 | + B = acquire!(pool, Float64, n, n) |
| 43 | + C = acquire!(pool, Float64, n, n) |
148 | 44 |
|
149 | | -```julia |
150 | | -@with_pool pool function unsafe_return(n) |
151 | | - v = acquire!(pool, Float64, n) |
152 | | - v .= 1.0 |
153 | | - return v # ❌ UNSAFE: Returning pool-backed array! |
| 45 | + rand!(A); rand!(B) |
| 46 | + mul!(C, A, B) |
| 47 | + return sum(C) |
154 | 48 | end |
155 | 49 |
|
156 | | -result = unsafe_return(100) |
157 | | -# result now points to memory that may be overwritten! |
158 | | - |
159 | | -# ❌ Also unsafe: storing in global variables, closures, etc. |
160 | | -global_storage = nothing |
161 | | -@with_pool pool begin |
162 | | - v = acquire!(pool, Float64, 100) |
163 | | - global_storage = v # ❌ UNSAFE: escaping via global |
| 50 | +compute_pooled(100) # warmup |
| 51 | +for i in 1:10_000 |
| 52 | + compute_pooled(100) # 0 bytes, 0% GC |
164 | 53 | end |
165 | 54 | ``` |
166 | 55 |
|
167 | | -</details> |
| 56 | +| Approach | Memory | GC Time | Code Complexity | |
| 57 | +|----------|--------|---------|-----------------| |
| 58 | +| Naive allocation | 91 MiB | 17% | Simple | |
| 59 | +| Manual buffer passing | 0 | 0% | Complex, invasive refactor | |
| 60 | +| **AdaptiveArrayPools** | **0** | **0%** | **Minimal change** | |
168 | 61 |
|
169 | | -<details> |
170 | | -<summary><b>Debugging with POOL_DEBUG</b> (click to expand)</summary> |
| 62 | +> **CUDA support**: Same API—just use `@with_pool :cuda pool`. See [CUDA Backend](docs/cuda.md). |
171 | 63 |
|
172 | | -Enable `POOL_DEBUG` to catch direct returns of pool-backed arrays: |
| 64 | +## How It Works |
173 | 65 |
|
174 | | -```julia |
175 | | -POOL_DEBUG[] = true # Enable safety checks |
176 | | - |
177 | | -@with_pool pool begin |
178 | | - v = acquire!(pool, Float64, 10) |
179 | | - v # Throws ErrorException: "Returning pool-backed array..." |
180 | | -end |
181 | | -``` |
182 | | - |
183 | | -> **Note:** `POOL_DEBUG` only catches direct returns, not indirect escapes (globals, closures). It's a development aid, not a guarantee. |
| 66 | +`@with_pool` automatically manages memory lifecycle for you: |
184 | 67 |
|
185 | | -</details> |
| 68 | +1. **Checkpoint** — Saves current pool state when entering the block |
| 69 | +2. **Acquire** — `acquire!` returns arrays backed by pooled memory |
| 70 | +3. **Rewind** — When the block ends, all acquired arrays are recycled for reuse |
186 | 71 |
|
187 | | -## Key Features |
| 72 | +This automatic checkpoint/rewind cycle is what enables zero allocation on repeated calls. You just write normal-looking code with `acquire!` instead of constructors. |
188 | 73 |
|
189 | | -- **`acquire!` — True Zero Allocation**: Returns lightweight views (`SubArray` for 1D, `ReshapedArray` for N-D) that are created on the stack. **Always 0 bytes**, regardless of dimension patterns or cache state. |
190 | | -- **`unsafe_acquire!` — Cached Allocation**: Returns concrete `Array` types (`Vector{T}` for 1D, `Array{T,N}` for N-D) for FFI/type constraints. |
191 | | - - All dimensions use N-way set-associative cache (default: 4-way) → **0 bytes on cache hit**, ~100 bytes on cache miss. |
192 | | - - Increase `CACHE_WAYS` if you alternate between >4 dimension patterns per slot. |
193 | | - - Even on cache miss, this is just the `Array` header (metadata)—**actual data memory is always reused from the pool**. |
194 | | -- **Low Overhead**: Optimized to have < 100 ns overhead for pool management, suitable for tight inner loops. |
195 | | -- **Task-Local Isolation**: Each Task gets its own pool via `task_local_storage()`. Thread-safe when `@with_pool` is called within each task's scope (see [Multi-Threading Usage](#multi-threading-usage) below). |
196 | | -- **Type Stable**: Optimized for `Float64`, `Int`, and other common types using fixed-slot caching. |
197 | | -- **Non-Intrusive**: If you disable pooling via preferences, `acquire!` compiles down to a standard `Array` allocation. |
198 | | -- **Flexible API**: Use `acquire!` for safe views (recommended), or `unsafe_acquire!` when concrete `Array` type is required (FFI, type constraints). |
| 74 | +`acquire!` returns lightweight views (`SubArray`, `ReshapedArray`) that work seamlessly with BLAS/LAPACK. If you need native `Array` types (FFI, type constraints), use `unsafe_acquire!`—see [API Reference](docs/api.md). |
199 | 75 |
|
200 | | -## Multi-Threading Usage |
| 76 | +> **Note**: Keeping acquired arrays inside the scope is your responsibility. Return computed values (scalars, copies), not the arrays themselves. See [Safety Guide](docs/safety.md). |
201 | 77 |
|
202 | | -AdaptiveArrayPools uses `task_local_storage()` for **task-local isolation**: each Julia Task gets its own independent pool. |
| 78 | +**Thread-safe by design**: Each Julia Task gets its own independent pool, so `@with_pool` inside threaded code is automatically safe: |
203 | 79 |
|
204 | 80 | ```julia |
205 | | -# ✅ SAFE: @with_pool inside @threads |
206 | 81 | Threads.@threads for i in 1:N |
207 | 82 | @with_pool pool begin |
208 | 83 | a = acquire!(pool, Float64, 100) |
| 84 | + # each thread has its own pool — no race conditions |
209 | 85 | end |
210 | 86 | end |
211 | | - |
212 | | -# ❌ UNSAFE: @with_pool outside @threads (race condition!) |
213 | | -@with_pool pool Threads.@threads for i in 1:N |
214 | | - a = acquire!(pool, Float64, 100) # All threads share one pool! |
215 | | -end |
216 | 87 | ``` |
217 | 88 |
|
218 | | -| Pattern | Safety | |
219 | | -|---------|--------| |
220 | | -| `@with_pool` inside `@threads` | ✅ Safe | |
221 | | -| `@with_pool` outside `@threads` | ❌ Unsafe | |
222 | | -| Function with `@with_pool` called from `@threads` | ✅ Safe | |
223 | | - |
224 | | -> **Important**: Pool objects must not be shared across Tasks. This library does not add locks—correct usage is the user's responsibility. |
225 | | -
|
226 | | -For detailed explanation including Julia's Task/Thread model and why thread-local pools don't work, see **[Multi-Threading Guide](docs/multi-threading.md)**. |
227 | | - |
228 | | -## `acquire!` vs `unsafe_acquire!` |
229 | | - |
230 | | -**In most cases, use `acquire!`**. It returns view types (`SubArray` for 1D, `ReshapedArray` for N-D) that are safe and always zero-allocation. |
231 | | - |
232 | | -> **Performance Note**: BLAS/LAPACK functions (`mul!`, `lu!`, etc.) are fully optimized for `StridedArray`—there is **no performance difference** between views and raw arrays. Benchmarks show identical throughput. |
233 | | -
|
234 | | -Use `unsafe_acquire!` **only** when a concrete `Array{T,N}` type is required: |
235 | | -- **FFI/C interop**: External libraries expecting `Ptr{T}` from `Array` |
236 | | -- **Type constraints**: APIs that explicitly require `Matrix{T}` or `Vector{T}`, or type-unstable code where concrete types reduce dispatch overhead |
237 | | - |
238 | | -```julia |
239 | | -@with_pool pool begin |
240 | | - # ✅ Recommended: acquire! for general use (always 0 bytes) |
241 | | - A = acquire!(pool, Float64, 100, 100) # ReshapedArray |
242 | | - B = acquire!(pool, Float64, 100, 100) # ReshapedArray |
243 | | - C = acquire!(pool, Float64, 100, 100) # ReshapedArray |
244 | | - mul!(C, A, B) # ✅ BLAS works perfectly with views! |
245 | | - |
246 | | - # ⚠️ Only when concrete Array type is required: |
247 | | - M = unsafe_acquire!(pool, Float64, 100, 100) # Matrix{Float64} |
248 | | - ccall(:some_c_function, Cvoid, (Ptr{Float64},), M) # FFI needs Array |
249 | | -end |
250 | | -``` |
251 | | - |
252 | | -| Function | 1D Return | N-D Return | Allocation | |
253 | | -|----------|-----------|------------|------------| |
254 | | -| `acquire!` | `SubArray{T,1}` | `ReshapedArray{T,N}` | Always 0 bytes | |
255 | | -| `unsafe_acquire!` | `Vector{T}` | `Array{T,N}` | 0 bytes (hit) / ~100 bytes header (miss) | |
256 | | - |
257 | | -> **Note**: `unsafe_acquire!` always returns concrete `Array` types (including `Vector` for 1D). The N-way cache applies to all dimensions—up to `CACHE_WAYS` (default: 4) dimension patterns per slot; exceeding this causes header-only allocation per miss. |
258 | | -
|
259 | | -> **Warning**: Both functions return memory only valid within the `@with_pool` scope. Do NOT call `resize!`, `push!`, or `append!` on acquired arrays. |
260 | | -
|
261 | | -### API Aliases |
262 | | - |
263 | | -For explicit naming, you can use these aliases: |
| 89 | +## Installation |
264 | 90 |
|
265 | 91 | ```julia |
266 | | -acquire_view!(pool, T, dims...) # Same as acquire! → returns view types |
267 | | -acquire_array!(pool, T, dims...) # Same as unsafe_acquire! → returns Array |
| 92 | +using Pkg |
| 93 | +Pkg.Registry.add(Pkg.RegistrySpec(url="https://github.com/ProjectTorreyPines/FuseRegistry.jl.git")) |
| 94 | +Pkg.add("AdaptiveArrayPools") |
268 | 95 | ``` |
269 | 96 |
|
270 | 97 | ## Documentation |
271 | 98 |
|
272 | | -- [API Reference](docs/api.md) - Macros, functions, and types |
273 | | -- [Multi-Threading Guide](docs/multi-threading.md) - Task/Thread model, safe patterns, and design rationale |
274 | | -- [Runtime Toggle: @maybe_with_pool](docs/maybe_with_pool.md) - Control pooling at runtime |
275 | | -- [Configuration](docs/configuration.md) - Preferences.jl integration |
276 | | - |
277 | | -## Configuration |
278 | | - |
279 | | -Configure AdaptiveArrayPools via `LocalPreferences.toml`: |
280 | | - |
281 | | -```toml |
282 | | -[AdaptiveArrayPools] |
283 | | -use_pooling = false # ⭐ Primary: Disable pooling entirely |
284 | | -cache_ways = 8 # Secondary: N-way cache size (default: 4) |
285 | | -``` |
286 | | - |
287 | | -### Disabling Pooling (Primary Use Case) |
288 | | - |
289 | | -The most important configuration is **`use_pooling = false`**, which completely disables all pooling: |
290 | | - |
291 | | -```julia |
292 | | -# With use_pooling = false, acquire! becomes equivalent to: |
293 | | -acquire!(pool, Float64, n, n) → Matrix{Float64}(undef, n, n) |
294 | | -``` |
295 | | - |
296 | | -This is useful for: |
297 | | -- **Debugging**: Isolate pooling-related issues by comparing behavior |
298 | | -- **Benchmarking**: Measure pooling overhead vs direct allocation |
299 | | -- **Gradual adoption**: Add `@with_pool` to code without changing behavior until ready |
300 | | - |
301 | | -When disabled, all macros generate `pool = nothing` and `acquire!` falls back to standard allocation with **zero overhead**. |
302 | | - |
303 | | -### N-way Cache Tuning (Advanced) |
304 | | - |
305 | | -```julia |
306 | | -using AdaptiveArrayPools |
307 | | -set_cache_ways!(8) # Requires Julia restart |
308 | | -``` |
309 | | - |
310 | | -Increase `cache_ways` if alternating between >4 dimension patterns per slot. |
| 99 | +| Guide | Description | |
| 100 | +|-------|-------------| |
| 101 | +| [API Reference](docs/api.md) | Complete function and macro reference | |
| 102 | +| [CUDA Backend](docs/cuda.md) | GPU-specific usage and examples | |
| 103 | +| [Safety Guide](docs/safety.md) | Scope rules and best practices | |
| 104 | +| [Multi-Threading](docs/multi-threading.md) | Task/thread safety patterns | |
| 105 | +| [Configuration](docs/configuration.md) | Preferences and cache tuning | |
311 | 106 |
|
312 | 107 | ## License |
313 | 108 |
|
|
0 commit comments