[Unmanaged Memory] Memory bloating on heavily looped calls due to delayed finaliziation by GC

## Overview

NDArray's unmanaged buffers are released through the CLR finalizer thread, not deterministically. In hot allocation paths (training loops, batch operations, benchmark sweeps), the finalizer queue cannot keep up with allocation pressure, causing native memory to stay committed long after the managed NDArray wrappers are unreachable. This inflates process working set, triggers Windows page-trimming on actively used buffers, and produces non-linear slowdowns the more the application allocates.

## Reproduction

Allocate a large number of NDArrays in sequence, drop the managed references, and observe that a plain `GC.Collect()` does not return the unmanaged memory. Only `GC.WaitForPendingFinalizers()` actually releases the buffers.

```csharp
using System;
using System.Diagnostics;
using NumSharp;

var p = Process.GetCurrentProcess();

GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();
long baseline = p.WorkingSet64;

// Allocate 800 MiB of unmanaged memory backing 100 NDArrays (8 MiB each).
NDArray[] arrs = new NDArray[100];
for (int i = 0; i < 100; i++)
    arrs[i] = new NDArray(NPTypeCode.Double, new Shape(1_000_000), fillZeros: true);

p.Refresh();
Console.WriteLine($"after alloc:                {(p.WorkingSet64 - baseline)/1024/1024,5} MiB");

// Drop every managed reference.
for (int i = 0; i < 100; i++) arrs[i] = null;
arrs = null;

// One full collection — should be enough to release everything if NDArray
// owned its native buffer deterministically.
GC.Collect();
p.Refresh();
Console.WriteLine($"after GC.Collect:           {(p.WorkingSet64 - baseline)/1024/1024,5} MiB");

// Now force the finalizer thread to drain its queue.
GC.WaitForPendingFinalizers();
GC.Collect();
p.Refresh();
Console.WriteLine($"after WaitForPendingFinal:  {(p.WorkingSet64 - baseline)/1024/1024,5} MiB");
```

## Expected behavior

NumPy frees array buffers synchronously when the last reference drops (CPython refcount → `array_dealloc` → `free(fa->data)`). A loop like:

```python
import numpy as np
import os, psutil
p = psutil.Process(os.getpid())
print(p.memory_info().rss // 1024 // 1024, "MiB")
for _ in range(100):
    a = np.zeros(1_000_000, dtype=np.float64)
print(p.memory_info().rss // 1024 // 1024, "MiB")
```

…never grows the resident set, because each `a` is freed *during* the next assignment when refcount drops to zero. No GC pass required.

NumSharp should give the same memory-safety guarantee: once an array is no longer reachable, its unmanaged buffer should be released without depending on whichever pass of the CLR finalizer thread happens to drain it.

## Actual behavior

Output from the reproduction above on Windows 11 / .NET 10:

```
after alloc:                  765 MiB
after GC.Collect:             519 MiB    ← ~497 MiB still held
after WaitForPendingFinal:      0 MiB
```

After `GC.Collect()` alone, roughly half a gigabyte of unmanaged memory was still committed even though no managed reference points at it. Only the explicit `WaitForPendingFinalizers()` call drained the queue and released the buffers.

## Performance impact

The same effect surfaces as a real wall-time regression under sweep workloads. The `np.concatenate` benchmark sweep (55 scenarios run in sequence) shows:

| Scenario | NumPy | NumSharp in isolation | NumSharp after the full sweep |
|---|---:|---:|---:|
| `out_mixed_to_float64` (1M float32 + 1M int32 → 2M float64, `out=`) | 0.50 ms | 0.52 ms (≈ NumPy) | **2.54 ms (5.05× NumPy)** |

In isolation the cross-dtype copy is competitive with NumPy. Run inside a sweep that has already churned through dozens of large allocations, the same call slows down ~5× — not because the kernel changed, but because cumulative working-set pressure from undrained finalizers causes the OS to trim active pages of the destination buffer, which then fault back in during the copy.

## Root cause (mechanism)

The chain that holds an NDArray's native buffer alive looks like:

```
NDArray  →  UnmanagedStorage  →  IArraySlice  →  UnmanagedMemoryBlock<T>  →  Disposer  (~Disposer())
                                                                                  └── NativeMemory.Free
```

- `NDArray` is a plain class. It does not implement `IDisposable` and has no finalizer.
- The native `free` call lives in `~Disposer()`, which runs on the CLR finalizer thread.
- The finalizer thread is **single-threaded** at THREAD_PRIORITY_NORMAL. Under sustained allocation pressure it falls behind.
- Until a Disposer's finalizer actually runs, its `NativeMemory.Alloc`-backed buffer stays committed even though it is unreachable from managed code.
- `GC.AddMemoryPressure` is already wired up (`UnmanagedMemoryBlock`1.cs:1012`) to nudge the GC scheduler, but it only makes collections happen sooner — it cannot run finalizers synchronously.

The user cannot release a buffer eagerly: there is no public path that drops the underlying allocation when the user knows the array is no longer needed.

## Where this hurts in practice

- **Training / inference loops**: each iteration creates intermediate arrays (`a + b`, `a * c`, …). All of them queue for finalization. Working set climbs across epochs until either GC catches up or the OS pages.
- **Batch processing**: hundreds of NDArrays allocated and abandoned per batch. The finalizer queue lags behind the allocation rate.
- **Benchmarks / test sweeps**: sequential scenarios contaminate each other's measurements because the finalizer queue from earlier scenarios is still draining when later ones run.
- **Long-running services**: working set can grow to hundreds of MiB above the actual live set, increasing the chance of OS-level page trimming and the soft page faults that follow.

## Acceptance criteria

A fix for this problem should:

- [ ] Let callers release an NDArray's unmanaged buffer **synchronously**, without depending on `GC.Collect` / `GC.WaitForPendingFinalizers`.
- [ ] Stay safe in the presence of views and aliases (releasing one reference must not leave dangling pointers in another).
- [ ] Keep the finalizer-based path as a safety net for code that doesn't opt in.
- [ ] Not break any of the ~8500 existing unit tests.
- [ ] Show measurable improvement on the `concat` benchmark sweep's `out_mixed_to_float64` and `prom_*` scenarios when arrays are explicitly released.

## Repro environment

- OS: Windows 11 (the working-set trimming behavior is OS-dependent; Linux exhibits the symptom differently)
- .NET: 8.0 and 10.0 — both reproduce
- NumSharp: `nditer` branch as of 2026-05-21
- Verified empirically: `script.cs` reproduction (see Reproduction section) and `test/NumSharp.Benchmark` `concat` sweep.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Unmanaged Memory] Memory bloating on heavily looped calls due to delayed finaliziation by GC #613

Overview

Reproduction

Expected behavior

Actual behavior

Performance impact

Root cause (mechanism)

Where this hurts in practice

Acceptance criteria

Repro environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Unmanaged Memory] Memory bloating on heavily looped calls due to delayed finaliziation by GC #613

Description

Overview

Reproduction

Expected behavior

Actual behavior

Performance impact

Root cause (mechanism)

Where this hurts in practice

Acceptance criteria

Repro environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions