Overview
NDArray's unmanaged buffers are released through the CLR finalizer thread, not deterministically. In hot allocation paths (training loops, batch operations, benchmark sweeps), the finalizer queue cannot keep up with allocation pressure, causing native memory to stay committed long after the managed NDArray wrappers are unreachable. This inflates process working set, triggers Windows page-trimming on actively used buffers, and produces non-linear slowdowns the more the application allocates.
Reproduction
Allocate a large number of NDArrays in sequence, drop the managed references, and observe that a plain GC.Collect() does not return the unmanaged memory. Only GC.WaitForPendingFinalizers() actually releases the buffers.
using System;
using System.Diagnostics;
using NumSharp;
var p = Process.GetCurrentProcess();
GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();
long baseline = p.WorkingSet64;
// Allocate 800 MiB of unmanaged memory backing 100 NDArrays (8 MiB each).
NDArray[] arrs = new NDArray[100];
for (int i = 0; i < 100; i++)
arrs[i] = new NDArray(NPTypeCode.Double, new Shape(1_000_000), fillZeros: true);
p.Refresh();
Console.WriteLine($"after alloc: {(p.WorkingSet64 - baseline)/1024/1024,5} MiB");
// Drop every managed reference.
for (int i = 0; i < 100; i++) arrs[i] = null;
arrs = null;
// One full collection — should be enough to release everything if NDArray
// owned its native buffer deterministically.
GC.Collect();
p.Refresh();
Console.WriteLine($"after GC.Collect: {(p.WorkingSet64 - baseline)/1024/1024,5} MiB");
// Now force the finalizer thread to drain its queue.
GC.WaitForPendingFinalizers();
GC.Collect();
p.Refresh();
Console.WriteLine($"after WaitForPendingFinal: {(p.WorkingSet64 - baseline)/1024/1024,5} MiB");
Expected behavior
NumPy frees array buffers synchronously when the last reference drops (CPython refcount → array_dealloc → free(fa->data)). A loop like:
import numpy as np
import os, psutil
p = psutil.Process(os.getpid())
print(p.memory_info().rss // 1024 // 1024, "MiB")
for _ in range(100):
a = np.zeros(1_000_000, dtype=np.float64)
print(p.memory_info().rss // 1024 // 1024, "MiB")
…never grows the resident set, because each a is freed during the next assignment when refcount drops to zero. No GC pass required.
NumSharp should give the same memory-safety guarantee: once an array is no longer reachable, its unmanaged buffer should be released without depending on whichever pass of the CLR finalizer thread happens to drain it.
Actual behavior
Output from the reproduction above on Windows 11 / .NET 10:
after alloc: 765 MiB
after GC.Collect: 519 MiB ← ~497 MiB still held
after WaitForPendingFinal: 0 MiB
After GC.Collect() alone, roughly half a gigabyte of unmanaged memory was still committed even though no managed reference points at it. Only the explicit WaitForPendingFinalizers() call drained the queue and released the buffers.
Performance impact
The same effect surfaces as a real wall-time regression under sweep workloads. The np.concatenate benchmark sweep (55 scenarios run in sequence) shows:
| Scenario |
NumPy |
NumSharp in isolation |
NumSharp after the full sweep |
out_mixed_to_float64 (1M float32 + 1M int32 → 2M float64, out=) |
0.50 ms |
0.52 ms (≈ NumPy) |
2.54 ms (5.05× NumPy) |
In isolation the cross-dtype copy is competitive with NumPy. Run inside a sweep that has already churned through dozens of large allocations, the same call slows down ~5× — not because the kernel changed, but because cumulative working-set pressure from undrained finalizers causes the OS to trim active pages of the destination buffer, which then fault back in during the copy.
Root cause (mechanism)
The chain that holds an NDArray's native buffer alive looks like:
NDArray → UnmanagedStorage → IArraySlice → UnmanagedMemoryBlock<T> → Disposer (~Disposer())
└── NativeMemory.Free
NDArray is a plain class. It does not implement IDisposable and has no finalizer.
- The native
free call lives in ~Disposer(), which runs on the CLR finalizer thread.
- The finalizer thread is single-threaded at THREAD_PRIORITY_NORMAL. Under sustained allocation pressure it falls behind.
- Until a Disposer's finalizer actually runs, its
NativeMemory.Alloc-backed buffer stays committed even though it is unreachable from managed code.
GC.AddMemoryPressure is already wired up (UnmanagedMemoryBlock1.cs:1012`) to nudge the GC scheduler, but it only makes collections happen sooner — it cannot run finalizers synchronously.
The user cannot release a buffer eagerly: there is no public path that drops the underlying allocation when the user knows the array is no longer needed.
Where this hurts in practice
- Training / inference loops: each iteration creates intermediate arrays (
a + b, a * c, …). All of them queue for finalization. Working set climbs across epochs until either GC catches up or the OS pages.
- Batch processing: hundreds of NDArrays allocated and abandoned per batch. The finalizer queue lags behind the allocation rate.
- Benchmarks / test sweeps: sequential scenarios contaminate each other's measurements because the finalizer queue from earlier scenarios is still draining when later ones run.
- Long-running services: working set can grow to hundreds of MiB above the actual live set, increasing the chance of OS-level page trimming and the soft page faults that follow.
Acceptance criteria
A fix for this problem should:
Repro environment
- OS: Windows 11 (the working-set trimming behavior is OS-dependent; Linux exhibits the symptom differently)
- .NET: 8.0 and 10.0 — both reproduce
- NumSharp:
nditer branch as of 2026-05-21
- Verified empirically:
script.cs reproduction (see Reproduction section) and test/NumSharp.Benchmark concat sweep.
Overview
NDArray's unmanaged buffers are released through the CLR finalizer thread, not deterministically. In hot allocation paths (training loops, batch operations, benchmark sweeps), the finalizer queue cannot keep up with allocation pressure, causing native memory to stay committed long after the managed NDArray wrappers are unreachable. This inflates process working set, triggers Windows page-trimming on actively used buffers, and produces non-linear slowdowns the more the application allocates.
Reproduction
Allocate a large number of NDArrays in sequence, drop the managed references, and observe that a plain
GC.Collect()does not return the unmanaged memory. OnlyGC.WaitForPendingFinalizers()actually releases the buffers.Expected behavior
NumPy frees array buffers synchronously when the last reference drops (CPython refcount →
array_dealloc→free(fa->data)). A loop like:…never grows the resident set, because each
ais freed during the next assignment when refcount drops to zero. No GC pass required.NumSharp should give the same memory-safety guarantee: once an array is no longer reachable, its unmanaged buffer should be released without depending on whichever pass of the CLR finalizer thread happens to drain it.
Actual behavior
Output from the reproduction above on Windows 11 / .NET 10:
After
GC.Collect()alone, roughly half a gigabyte of unmanaged memory was still committed even though no managed reference points at it. Only the explicitWaitForPendingFinalizers()call drained the queue and released the buffers.Performance impact
The same effect surfaces as a real wall-time regression under sweep workloads. The
np.concatenatebenchmark sweep (55 scenarios run in sequence) shows:out_mixed_to_float64(1M float32 + 1M int32 → 2M float64,out=)In isolation the cross-dtype copy is competitive with NumPy. Run inside a sweep that has already churned through dozens of large allocations, the same call slows down ~5× — not because the kernel changed, but because cumulative working-set pressure from undrained finalizers causes the OS to trim active pages of the destination buffer, which then fault back in during the copy.
Root cause (mechanism)
The chain that holds an NDArray's native buffer alive looks like:
NDArrayis a plain class. It does not implementIDisposableand has no finalizer.freecall lives in~Disposer(), which runs on the CLR finalizer thread.NativeMemory.Alloc-backed buffer stays committed even though it is unreachable from managed code.GC.AddMemoryPressureis already wired up (UnmanagedMemoryBlock1.cs:1012`) to nudge the GC scheduler, but it only makes collections happen sooner — it cannot run finalizers synchronously.The user cannot release a buffer eagerly: there is no public path that drops the underlying allocation when the user knows the array is no longer needed.
Where this hurts in practice
a + b,a * c, …). All of them queue for finalization. Working set climbs across epochs until either GC catches up or the OS pages.Acceptance criteria
A fix for this problem should:
GC.Collect/GC.WaitForPendingFinalizers.concatbenchmark sweep'sout_mixed_to_float64andprom_*scenarios when arrays are explicitly released.Repro environment
nditerbranch as of 2026-05-21script.csreproduction (see Reproduction section) andtest/NumSharp.Benchmarkconcatsweep.