Skip to content

[Unmanaged Memory] Memory bloating on heavily looped calls due to delayed finaliziation by GC #613

@Nucs

Description

@Nucs

Overview

NDArray's unmanaged buffers are released through the CLR finalizer thread, not deterministically. In hot allocation paths (training loops, batch operations, benchmark sweeps), the finalizer queue cannot keep up with allocation pressure, causing native memory to stay committed long after the managed NDArray wrappers are unreachable. This inflates process working set, triggers Windows page-trimming on actively used buffers, and produces non-linear slowdowns the more the application allocates.

Reproduction

Allocate a large number of NDArrays in sequence, drop the managed references, and observe that a plain GC.Collect() does not return the unmanaged memory. Only GC.WaitForPendingFinalizers() actually releases the buffers.

using System;
using System.Diagnostics;
using NumSharp;

var p = Process.GetCurrentProcess();

GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();
long baseline = p.WorkingSet64;

// Allocate 800 MiB of unmanaged memory backing 100 NDArrays (8 MiB each).
NDArray[] arrs = new NDArray[100];
for (int i = 0; i < 100; i++)
    arrs[i] = new NDArray(NPTypeCode.Double, new Shape(1_000_000), fillZeros: true);

p.Refresh();
Console.WriteLine($"after alloc:                {(p.WorkingSet64 - baseline)/1024/1024,5} MiB");

// Drop every managed reference.
for (int i = 0; i < 100; i++) arrs[i] = null;
arrs = null;

// One full collection — should be enough to release everything if NDArray
// owned its native buffer deterministically.
GC.Collect();
p.Refresh();
Console.WriteLine($"after GC.Collect:           {(p.WorkingSet64 - baseline)/1024/1024,5} MiB");

// Now force the finalizer thread to drain its queue.
GC.WaitForPendingFinalizers();
GC.Collect();
p.Refresh();
Console.WriteLine($"after WaitForPendingFinal:  {(p.WorkingSet64 - baseline)/1024/1024,5} MiB");

Expected behavior

NumPy frees array buffers synchronously when the last reference drops (CPython refcount → array_deallocfree(fa->data)). A loop like:

import numpy as np
import os, psutil
p = psutil.Process(os.getpid())
print(p.memory_info().rss // 1024 // 1024, "MiB")
for _ in range(100):
    a = np.zeros(1_000_000, dtype=np.float64)
print(p.memory_info().rss // 1024 // 1024, "MiB")

…never grows the resident set, because each a is freed during the next assignment when refcount drops to zero. No GC pass required.

NumSharp should give the same memory-safety guarantee: once an array is no longer reachable, its unmanaged buffer should be released without depending on whichever pass of the CLR finalizer thread happens to drain it.

Actual behavior

Output from the reproduction above on Windows 11 / .NET 10:

after alloc:                  765 MiB
after GC.Collect:             519 MiB    ← ~497 MiB still held
after WaitForPendingFinal:      0 MiB

After GC.Collect() alone, roughly half a gigabyte of unmanaged memory was still committed even though no managed reference points at it. Only the explicit WaitForPendingFinalizers() call drained the queue and released the buffers.

Performance impact

The same effect surfaces as a real wall-time regression under sweep workloads. The np.concatenate benchmark sweep (55 scenarios run in sequence) shows:

Scenario NumPy NumSharp in isolation NumSharp after the full sweep
out_mixed_to_float64 (1M float32 + 1M int32 → 2M float64, out=) 0.50 ms 0.52 ms (≈ NumPy) 2.54 ms (5.05× NumPy)

In isolation the cross-dtype copy is competitive with NumPy. Run inside a sweep that has already churned through dozens of large allocations, the same call slows down ~5× — not because the kernel changed, but because cumulative working-set pressure from undrained finalizers causes the OS to trim active pages of the destination buffer, which then fault back in during the copy.

Root cause (mechanism)

The chain that holds an NDArray's native buffer alive looks like:

NDArray  →  UnmanagedStorage  →  IArraySlice  →  UnmanagedMemoryBlock<T>  →  Disposer  (~Disposer())
                                                                                  └── NativeMemory.Free
  • NDArray is a plain class. It does not implement IDisposable and has no finalizer.
  • The native free call lives in ~Disposer(), which runs on the CLR finalizer thread.
  • The finalizer thread is single-threaded at THREAD_PRIORITY_NORMAL. Under sustained allocation pressure it falls behind.
  • Until a Disposer's finalizer actually runs, its NativeMemory.Alloc-backed buffer stays committed even though it is unreachable from managed code.
  • GC.AddMemoryPressure is already wired up (UnmanagedMemoryBlock1.cs:1012`) to nudge the GC scheduler, but it only makes collections happen sooner — it cannot run finalizers synchronously.

The user cannot release a buffer eagerly: there is no public path that drops the underlying allocation when the user knows the array is no longer needed.

Where this hurts in practice

  • Training / inference loops: each iteration creates intermediate arrays (a + b, a * c, …). All of them queue for finalization. Working set climbs across epochs until either GC catches up or the OS pages.
  • Batch processing: hundreds of NDArrays allocated and abandoned per batch. The finalizer queue lags behind the allocation rate.
  • Benchmarks / test sweeps: sequential scenarios contaminate each other's measurements because the finalizer queue from earlier scenarios is still draining when later ones run.
  • Long-running services: working set can grow to hundreds of MiB above the actual live set, increasing the chance of OS-level page trimming and the soft page faults that follow.

Acceptance criteria

A fix for this problem should:

  • Let callers release an NDArray's unmanaged buffer synchronously, without depending on GC.Collect / GC.WaitForPendingFinalizers.
  • Stay safe in the presence of views and aliases (releasing one reference must not leave dangling pointers in another).
  • Keep the finalizer-based path as a safety net for code that doesn't opt in.
  • Not break any of the ~8500 existing unit tests.
  • Show measurable improvement on the concat benchmark sweep's out_mixed_to_float64 and prom_* scenarios when arrays are explicitly released.

Repro environment

  • OS: Windows 11 (the working-set trimming behavior is OS-dependent; Linux exhibits the symptom differently)
  • .NET: 8.0 and 10.0 — both reproduce
  • NumSharp: nditer branch as of 2026-05-21
  • Verified empirically: script.cs reproduction (see Reproduction section) and test/NumSharp.Benchmark concat sweep.

Metadata

Metadata

Assignees

Labels

architectureCross-cutting structural changes affecting multiple componentscoreInternal engine: Shape, Storage, TensorEngine, iteratorsperformancePerformance improvements or optimizations

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions