Add managed-memory advise, prefetch, and discard-prefetch free functions by rparolin · Pull Request #1775 · NVIDIA/cuda-python

rparolin · 2026-03-17T00:38:04Z

Summary

Adds managed-memory range operations to cuda.core:

Free functions in cuda.core.utils: advise, prefetch, discard, discard_prefetch. Each accepts either a single Buffer or a sequence; N==1 dispatches to the per-range driver entry point and N>1 dispatches to the corresponding cuMem*BatchAsync (CUDA 13+).
Host — new top-level singleton class symmetric to Device. Host() (any host), Host(numa_id=N), Host.numa_current(). Same-argument constructions are interned (Host() is Host()). Used together with Device to express managed-memory locations.
ManagedBuffer — Buffer subclass returned by ManagedMemoryResource.allocate. Exposes a Pythonic property-style advice API on top of the same free functions. Wrap an external managed pointer with Buffer.from_handle(...) (now a @classmethod, so ManagedBuffer.from_handle(...) returns a ManagedBuffer).

Closes #1332. Addresses the managed-memory portion of #1333 (P1: cuMemPrefetchBatchAsync, cuMemDiscardBatchAsync, cuMemDiscardAndPrefetchBatchAsync). The P0 cuMemcpyBatchAsync from #1333 is intentionally out of scope and tracked separately; the holistic batched-API contract this PR commits to is documented in #issuecomment-4355502334 so the upcoming cuMemcpyBatchAsync work can mirror it.

Public API

`ManagedBuffer` — property-style advice on managed allocations

ManagedMemoryResource.allocate returns a ManagedBuffer (a Buffer subclass). All ManagedBuffer-specific behavior is layered on top of the free functions, so the two surfaces stay consistent.

from cuda.core import Device, Host, ManagedMemoryResource

mr = ManagedMemoryResource()
buf = mr.allocate(size)                # ManagedBuffer

# Driver-backed properties — getter queries the driver, setter calls cuMemAdvise.
buf.read_mostly = True
buf.preferred_location = Device(0)     # or Host(), or Host(numa_id=N)
buf.preferred_location = None          # unset

# Live set-like view of `set_accessed_by` advice.
buf.accessed_by.add(Device(1))
buf.accessed_by.discard(Device(1))
buf.accessed_by = {Device(0), Device(1)}   # diff vs current; advise only deltas

# Instance methods delegate to the matching free functions.
buf.prefetch(Device(0), stream=stream)
buf.discard(stream=stream)
buf.discard_prefetch(Device(0), stream=stream)

Note: on CUDA 13 builds, preferred_location round-trips full NUMA detail via the v2 attribute (Host(numa_id=N) and Host.numa_current() are preserved on read-back). On CUDA 12 builds, the legacy cuMemRangeGetAttribute query path returns integer device ordinals, so Host(numa_id=...) collapses to a generic Host() on read-back. Setters preserve full NUMA information when issuing advice on both toolkits.

Free functions — `advise` / `prefetch` / `discard` / `discard_prefetch`

Each accepts a Buffer (or ManagedBuffer) or a sequence of them. Locations are expressed via Device or Host.

from cuda.core import Device, Host
from cuda.core.utils import advise, prefetch, discard, discard_prefetch

# Stage to GPU, kernel, bring back to host
prefetch(buf, Device(0), stream=stream)
launch_my_kernel(buf, stream=stream)
prefetch(buf, Host(), stream=stream)
stream.sync()
result = bytes(buf)

# Advice
advise(weights, "set_read_mostly")
advise(activations, "set_preferred_location", Device(0))
advise(scratch, "set_accessed_by", Device(0))

# Discard / discard+prefetch (CUDA 13+)
discard(scratch, stream=stream)
for step in range(num_steps):
    discard_prefetch(activations, Device(0), stream=stream)
    run_forward(activations, stream=stream)

Batched form — same function, sequence of targets

When N>1, dispatch goes to the corresponding cuMem*BatchAsync. Sequence locations are paired by index; a scalar location broadcasts to every target.

# Pair-by-index: output → GPU 0, log_metrics → host
prefetch(
    [output, log_metrics],
    [Device(0), Host()],
    stream=stream,
)

# Scalar broadcast: every shard moves to GPU 0
prefetch([shard_a, shard_b, shard_c], Device(0), stream=stream)

Mismatched sequence lengths raise ValueError. On a CUDA 12 build of cuda.core, N>1 raises NotImplementedError (the *BatchAsync entry points are CUDA 13+); N==1 works on every supported toolkit.

Putting it together

weights = mr.allocate(weights_size)    # ManagedBuffer
inputs  = mr.allocate(inputs_size)
output  = mr.allocate(output_size)

# One-time hints (property API on ManagedBuffer)
weights.read_mostly = True
weights.preferred_location = Device(0)
output.preferred_location = Device(0)

# Per inference
inputs.prefetch(Device(0), stream=stream)
run_inference(weights, inputs, output, stream=stream)
output.prefetch(Host(), stream=stream)
stream.sync()

Implementation notes

Cython implementation in cuda_core/cuda/core/_memory/_managed_memory_ops.pyx uses cimport cydriver for direct C-level driver calls.
The CUDA 12 / 13 ABI split for cuMemAdvise and cuMemPrefetchAsync is handled at compile time with IF CUDA_CORE_BUILD_MAJOR >= 13: / ELSE:.
Batched entry points (cuMemPrefetchBatchAsync, cuMemDiscardBatchAsync, cuMemDiscardAndPrefetchBatchAsync) are CUDA 13+ only. On CUDA 12 builds, N>1 calls raise NotImplementedError; single-buffer calls work everywhere.
Host is a singleton with __slots__ and a __new__-based intern cache keyed by (numa_id, is_numa_current). Same-argument constructions return the same instance on both Python and Cython call paths.
ManagedBuffer is a pure-Python subclass of the Cython Buffer cdef class. Buffer.from_handle is now a @classmethod (was @staticmethod) so MyBufferSubclass.from_handle(...) returns the typed instance via cls._init. Buffer_from_deviceptr_handle and _MP_allocate thread an optional cls parameter so ManagedMemoryResource.allocate materializes a ManagedBuffer.
Internal _LocSpec (in _managed_location.py) carries the (kind, id) discriminator that the Cython layer maps to CUmemLocation (CUDA 13) or a legacy device ordinal (CUDA 12). Public callers see only Device / Host; _coerce_location produces the internal record.
_buffer.pyx collapses out.is_managed = (is_managed != 0) to a single unconditional assignment and adds a TODO noting that HMM/ATS-mapped sysmem is not yet captured by CU_POINTER_ATTRIBUTE_IS_MANAGED.

copy-pr-bot · 2026-03-17T00:38:08Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2026-03-17T01:07:52Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1775/
https://nvidia.github.io/cuda-python/pr-preview/pr-1775/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1775/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1775/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rparolin · 2026-03-17T16:42:11Z

/ok to test

jrhemstad · 2026-03-17T16:50:19Z

question: Does making these member functions of the Buffer type preclude this functionality for allocations that weren't created through the Buffer type? Did we consider making these free functions instead of member functions on the Buffer type?

rparolin · 2026-03-17T19:35:28Z

question: Does making these member functions of the Buffer type preclude this functionality for allocations that weren't created through the Buffer type? Did we consider making these free functions instead of member functions on the Buffer type?

I'm moving this back into draft. We discussed in our team meeting because I was already hesitant as Buffer is becoming a 'God object' with the functionality is gaining. We were going to explore alternatives. Free functions sounds like a good alternative to explore.

…ns in the cuda.core.managed_memory namespace

…ups, fix docs - Remove duplicate long-form "cu_mem_advise_*" string aliases from _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly - Replace 4 boolean allow_* params in _normalize_managed_location with a single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES - Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag, discard_prefetch support, and advice enum-to-alias reverse map - Collapse hasattr+getattr to single getattr in _managed_location_enum - Move _require_managed_discard_prefetch_support to top of discard_prefetch for fail-fast behavior - Fix docs build: reset Sphinx module scope after managed_memory section in api.rst so subsequent sections resolve under cuda.core - Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e legacy path The _V2_BINDINGS cache in _buffer.pyx persists across tests, so monkeypatching get_binding_version alone is insufficient when earlier tests have already populated the cache with the v2 value. Promote _V2_BINDINGS from cdef int to a Python-level variable so tests can monkeypatch it directly via monkeypatch.setattr, and reset it to -1 in both legacy-signature tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t real hardware These three tests call cuMemAdvise on real CUDA devices and verify memory range attributes. On devices without concurrent_managed_access (e.g. Windows/WDDM), set_read_mostly silently no-ops and set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the stricter _skip_if_managed_location_ops_unsupported guard, matching the pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s support Reorder checks in discard_prefetch so _normalize_managed_target_range runs before _require_managed_discard_prefetch_support. This ensures non-managed buffers raise ValueError before the RuntimeError for missing cuMemDiscardAndPrefetchBatchAsync support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ps module Move advise, prefetch, and discard_prefetch functions and their helpers out of _buffer.pyx into a new _managed_memory_ops Cython module to improve separation of concerns. Expose _init_mem_attrs and _query_memory_attrs as non-inline cdef functions in _buffer.pxd so the new module can reuse them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move _require_managed_buffer to the first statement of _advise_one so a non-managed buffer is rejected before advice/location parsing, matching the order in _do_single_prefetch_py and _do_single_discard_prefetch_py. This prevents surfacing an advice-validation error when the real problem is the buffer kind.

Rephrase the RuntimeError raised from _to_legacy_device when a caller passes Host(numa_id=...) or Host.numa_current() on a CUDA 12 build. The new message names the unsupported APIs and points the user at Host() as the working alternative, instead of leaking the internal location_type discriminator.

The CUDA 12 cuMemPrefetchAsync / cuMemAdvise ABI takes a plain device ordinal and cannot represent a specific host NUMA node. Previously _coerce_location accepted Host(numa_id=...) and Host.numa_current() on a CUDA 12 build and let the operation fail late inside the Cython layer with RuntimeError, which the public APIs surfaced as a confusing error from deep in the stack. Reject NUMA-host kinds at the call boundary in _coerce_location with a TypeError that names the unsupported APIs and points at Host() as the working alternative. Update the ManagedBuffer docstring to match the new contract, and broaden two host_numa-rejection test asserts to accept either the CUDA 13 kind-allowed ValueError or the CUDA 12 boundary TypeError. Addresses rwgk's Medium finding on PR NVIDIA#1775.

The previous setter computed (current - target) and (target - current) and called _advise_one in two loops. set(locations) raised TypeError on unhashable elements, but only after the first diff pair had already been issued, so an invalid RHS could leave accessed_by partially mutated. Reproduce: starting from {Device(0)}, assigning {Host(numa_id=0)} on CUDA 12 raises and leaves accessed_by == set(). Validate every target up-front (per-element isinstance(Device|Host)) and only then issue the diff loops, so a bad RHS raises before any driver state changes. Addresses rwgk's High finding on PR NVIDIA#1775.

rparolin · 2026-05-05T20:38:22Z

@rwgk re your High finding (accessed_by setter applying diffs incrementally before validation):

Done. 1b66367.

The setter now validates every RHS element (isinstance(Device | Host)) before issuing any cuMemAdvise, so a bad target raises TypeError cleanly instead of leaving accessed_by partially mutated.

rparolin · 2026-05-05T20:38:28Z

@rwgk re your Medium finding (CUDA 12 incoherence with Host(numa_id=...) / Host.numa_current()):

Done. bcc056b.

_coerce_location now rejects NUMA-host kinds at the call boundary on CUDA 12 with TypeError("... require a CUDA 13 build of cuda.core; use Host() on CUDA 12") — no more late RuntimeError from _to_legacy_device. The ManagedBuffer docstring is updated to match the new contract, and two host_numa-rejection test asserts now accept either the CUDA 13 kind-allowed ValueError or the CUDA 12 boundary TypeError.

rparolin · 2026-05-05T20:38:30Z

@rwgk re your Low finding (Host(True) aliasing Host(1) in the singleton cache):

Done. d0b6621.

Host.__new__ now rejects bool explicitly (isinstance(numa_id, bool) short-circuits before the int check, since bool is an int subclass). Added TestHost.test_numa_id_rejects_bool covering both True and False.

Collapses multi-line string concats and conditions back to single lines under the project's line-length limit. No behavior change.

…m_advise_prefetch # Conflicts: # cuda_core/docs/source/release/1.0.0-notes.rst

Host(numa_id=N) and Host.numa_current() require CUDA 13 bindings; the TestLocationCoerce passthroughs were missing the binding_version guard already used by test_preferred_location_roundtrip_host_numa.

…m_advise_prefetch # Conflicts: # cuda_core/cuda/core/utils.py # cuda_core/docs/source/api.rst # cuda_core/tests/test_memory.py

copy-pr-bot · 2026-05-06T23:23:40Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang

We're almost there!

…m_advise_prefetch # Conflicts: # cuda_core/docs/source/release/1.0.0-notes.rst

- ManagedMemoryResource.allocate: require explicit stream (kw-only), matching the post-NVIDIA#2020 convention across stream-scheduling APIs. - Batch free functions (discard_batch / prefetch_batch / discard_prefetch_batch): move stream to the first positional argument to mirror launch(stream, ...); add full type annotations. - Host: drop the redundant __setattr__ guard now that __slots__ alone enforces immutability. - test_managed_ops.py: extract memory_pool / location_ops / discard_prefetch fixture tiers, eliminating the device + mr + buffer preamble previously copied across most tests. - test_accessed_by_*: replace the hand-rolled MutableSet pass with helpers.collection_interface_testers.assert_single_member_mutable_set_interface introduced by NVIDIA#2018.

Captures the stream-argument shape (kw-only required vs launch-style positional), the __slots__-only immutability convention, and the pytest-fixture / helper-reuse expectations established while addressing PR NVIDIA#1775 review feedback. Future agents should hit these guardrails before writing code that recreates the same issues.

Closes the asymmetry leofang flagged: numa_id was a constructor arg but is_numa_current was only reachable via Host.numa_current(). Both state fields are now uniformly settable through the constructor; Host.numa_current() becomes a thin alias. The two are mutually exclusive — passing both raises ValueError.

…m_advise_prefetch

wip

abdec47

rparolin requested a review from Andy-Jost March 17, 2026 00:41

rparolin self-assigned this Mar 17, 2026

rparolin added this to the cuda.core v0.7.0 milestone Mar 17, 2026

wip

c418050

rparolin marked this pull request as ready for review March 17, 2026 00:45

rparolin marked this pull request as draft March 17, 2026 00:45

rparolin changed the title ~~wip~~ Add managed-memory advise, prefetch, and discard-prefetch on Buffer Mar 17, 2026

fixing ci compiler errors

b879fa5

rparolin marked this pull request as ready for review March 17, 2026 00:57

rparolin added 2 commits March 17, 2026 09:07

skipping tests that aren't supported

04ee3de

cu12 support

9ab3f46

Merge branch 'main' into rparolin/managed_mem_advise_prefetch

bd75bc3

rparolin marked this pull request as draft March 17, 2026 19:35

rparolin added 3 commits March 17, 2026 12:37

Merge branch 'main' into rparolin/managed_mem_advise_prefetch

1b1343b

Moving to function from Buffer class methods to free standing functio…

a948066

…ns in the cuda.core.managed_memory namespace

precommit format

1457599

rparolin marked this pull request as ready for review March 17, 2026 23:46

rparolin and others added 7 commits March 17, 2026 17:30

iterating on implementation

acb4024

Merge branch 'main' into rparolin/managed_mem_advise_prefetch

ae1de36

rparolin added 4 commits May 5, 2026 13:33

rparolin added 3 commits May 5, 2026 13:40

style(cuda.core): apply ruff format

5efbe4e

Collapses multi-line string concats and conditions back to single lines under the project's line-length limit. No behavior change.

Merge remote-tracking branch 'upstream/main' into rparolin/managed_me…

5e2c051

…m_advise_prefetch # Conflicts: # cuda_core/docs/source/release/1.0.0-notes.rst

Skip NUMA-aware Host coerce tests on CUDA 12 builds

8c35376

Host(numa_id=N) and Host.numa_current() require CUDA 13 bindings; the TestLocationCoerce passthroughs were missing the binding_version guard already used by test_preferred_location_roundtrip_host_numa.

rwgk approved these changes May 5, 2026

View reviewed changes

rparolin enabled auto-merge (squash) May 5, 2026 23:08

Merge remote-tracking branch 'upstream/main' into rparolin/managed_me…

29235b9

…m_advise_prefetch # Conflicts: # cuda_core/cuda/core/utils.py # cuda_core/docs/source/api.rst # cuda_core/tests/test_memory.py

leofang reviewed May 6, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/_host.py Outdated

leofang reviewed May 6, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/_host.py

leofang reviewed May 6, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/_memory/_managed_buffer.py

leofang requested changes May 6, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/_memory/_managed_memory_resource.pyx Outdated

Comment thread cuda_core/cuda/core/_memory/_managed_memory_resource.pyx Outdated

leofang reviewed May 7, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/_memory/_managed_memory_ops.pyx Outdated

leofang modified the milestones: cuda.core v1.0.0, cuda.core next May 7, 2026

rparolin added 6 commits May 8, 2026 12:28

Merge remote-tracking branch 'upstream/main' into rparolin/managed_me…

274fb95

…m_advise_prefetch # Conflicts: # cuda_core/docs/source/release/1.0.0-notes.rst

style(cuda.core): apply ruff format

cab26b1

Merge remote-tracking branch 'upstream/main' into rparolin/managed_me…

86f8284

…m_advise_prefetch

rparolin requested a review from leofang May 11, 2026 21:54

rparolin and others added 2 commits May 12, 2026 17:11

Merge branch 'main' into rparolin/managed_mem_advise_prefetch

4d53eff

Merge branch 'main' into rparolin/managed_mem_advise_prefetch

298ddbd

Conversation

rparolin commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Public API

ManagedBuffer — property-style advice on managed allocations

Free functions — advise / prefetch / discard / discard_prefetch

Batched form — same function, sequence of targets

Putting it together

Implementation notes

Uh oh!

copy-pr-bot Bot commented Mar 17, 2026

Uh oh!

github-actions Bot commented Mar 17, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rparolin commented Mar 17, 2026

Uh oh!

jrhemstad commented Mar 17, 2026

Uh oh!

rparolin commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rparolin commented May 5, 2026

Uh oh!

rparolin commented May 5, 2026

Uh oh!

rparolin commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rparolin commented Mar 17, 2026 •

edited

Loading

`ManagedBuffer` — property-style advice on managed allocations

Free functions — `advise` / `prefetch` / `discard` / `discard_prefetch`

rparolin commented Mar 17, 2026 •

edited

Loading