Add managed-memory advise, prefetch, and discard-prefetch free functions#1775
Add managed-memory advise, prefetch, and discard-prefetch free functions#1775rparolin wants to merge 83 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
|
/ok to test |
|
question: Does making these member functions of the |
I'm moving this back into draft. We discussed in our team meeting because I was already hesitant as Buffer is becoming a 'God object' with the functionality is gaining. We were going to explore alternatives. Free functions sounds like a good alternative to explore. |
…ns in the cuda.core.managed_memory namespace
…ups, fix docs - Remove duplicate long-form "cu_mem_advise_*" string aliases from _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly - Replace 4 boolean allow_* params in _normalize_managed_location with a single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES - Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag, discard_prefetch support, and advice enum-to-alias reverse map - Collapse hasattr+getattr to single getattr in _managed_location_enum - Move _require_managed_discard_prefetch_support to top of discard_prefetch for fail-fast behavior - Fix docs build: reset Sphinx module scope after managed_memory section in api.rst so subsequent sections resolve under cuda.core - Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e legacy path The _V2_BINDINGS cache in _buffer.pyx persists across tests, so monkeypatching get_binding_version alone is insufficient when earlier tests have already populated the cache with the v2 value. Promote _V2_BINDINGS from cdef int to a Python-level variable so tests can monkeypatch it directly via monkeypatch.setattr, and reset it to -1 in both legacy-signature tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t real hardware These three tests call cuMemAdvise on real CUDA devices and verify memory range attributes. On devices without concurrent_managed_access (e.g. Windows/WDDM), set_read_mostly silently no-ops and set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the stricter _skip_if_managed_location_ops_unsupported guard, matching the pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s support Reorder checks in discard_prefetch so _normalize_managed_target_range runs before _require_managed_discard_prefetch_support. This ensures non-managed buffers raise ValueError before the RuntimeError for missing cuMemDiscardAndPrefetchBatchAsync support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ps module Move advise, prefetch, and discard_prefetch functions and their helpers out of _buffer.pyx into a new _managed_memory_ops Cython module to improve separation of concerns. Expose _init_mem_attrs and _query_memory_attrs as non-inline cdef functions in _buffer.pxd so the new module can reuse them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move _require_managed_buffer to the first statement of _advise_one so a non-managed buffer is rejected before advice/location parsing, matching the order in _do_single_prefetch_py and _do_single_discard_prefetch_py. This prevents surfacing an advice-validation error when the real problem is the buffer kind.
Rephrase the RuntimeError raised from _to_legacy_device when a caller passes Host(numa_id=...) or Host.numa_current() on a CUDA 12 build. The new message names the unsupported APIs and points the user at Host() as the working alternative, instead of leaking the internal location_type discriminator.
The CUDA 12 cuMemPrefetchAsync / cuMemAdvise ABI takes a plain device ordinal and cannot represent a specific host NUMA node. Previously _coerce_location accepted Host(numa_id=...) and Host.numa_current() on a CUDA 12 build and let the operation fail late inside the Cython layer with RuntimeError, which the public APIs surfaced as a confusing error from deep in the stack. Reject NUMA-host kinds at the call boundary in _coerce_location with a TypeError that names the unsupported APIs and points at Host() as the working alternative. Update the ManagedBuffer docstring to match the new contract, and broaden two host_numa-rejection test asserts to accept either the CUDA 13 kind-allowed ValueError or the CUDA 12 boundary TypeError. Addresses rwgk's Medium finding on PR NVIDIA#1775.
The previous setter computed (current - target) and (target - current)
and called _advise_one in two loops. set(locations) raised TypeError
on unhashable elements, but only after the first diff pair had already
been issued, so an invalid RHS could leave accessed_by partially
mutated. Reproduce: starting from {Device(0)}, assigning
{Host(numa_id=0)} on CUDA 12 raises and leaves accessed_by == set().
Validate every target up-front (per-element isinstance(Device|Host))
and only then issue the diff loops, so a bad RHS raises before any
driver state changes.
Addresses rwgk's High finding on PR NVIDIA#1775.
|
@rwgk re your High finding ( Done. 1b66367. The setter now validates every RHS element ( |
|
@rwgk re your Medium finding (CUDA 12 incoherence with Done. bcc056b.
|
|
@rwgk re your Low finding ( Done. d0b6621.
|
Collapses multi-line string concats and conditions back to single lines under the project's line-length limit. No behavior change.
…m_advise_prefetch # Conflicts: # cuda_core/docs/source/release/1.0.0-notes.rst
Host(numa_id=N) and Host.numa_current() require CUDA 13 bindings; the TestLocationCoerce passthroughs were missing the binding_version guard already used by test_preferred_location_roundtrip_host_numa.
…m_advise_prefetch # Conflicts: # cuda_core/cuda/core/utils.py # cuda_core/docs/source/api.rst # cuda_core/tests/test_memory.py
…m_advise_prefetch # Conflicts: # cuda_core/docs/source/release/1.0.0-notes.rst
- ManagedMemoryResource.allocate: require explicit stream (kw-only), matching the post-NVIDIA#2020 convention across stream-scheduling APIs. - Batch free functions (discard_batch / prefetch_batch / discard_prefetch_batch): move stream to the first positional argument to mirror launch(stream, ...); add full type annotations. - Host: drop the redundant __setattr__ guard now that __slots__ alone enforces immutability. - test_managed_ops.py: extract memory_pool / location_ops / discard_prefetch fixture tiers, eliminating the device + mr + buffer preamble previously copied across most tests. - test_accessed_by_*: replace the hand-rolled MutableSet pass with helpers.collection_interface_testers.assert_single_member_mutable_set_interface introduced by NVIDIA#2018.
Captures the stream-argument shape (kw-only required vs launch-style positional), the __slots__-only immutability convention, and the pytest-fixture / helper-reuse expectations established while addressing PR NVIDIA#1775 review feedback. Future agents should hit these guardrails before writing code that recreates the same issues.
Closes the asymmetry leofang flagged: numa_id was a constructor arg but is_numa_current was only reachable via Host.numa_current(). Both state fields are now uniformly settable through the constructor; Host.numa_current() becomes a thin alias. The two are mutually exclusive — passing both raises ValueError.
…m_advise_prefetch
Summary
Adds managed-memory range operations to
cuda.core:cuda.core.utils:advise,prefetch,discard,discard_prefetch. Each accepts either a singleBufferor a sequence; N==1 dispatches to the per-range driver entry point and N>1 dispatches to the correspondingcuMem*BatchAsync(CUDA 13+).Host— new top-level singleton class symmetric toDevice.Host()(any host),Host(numa_id=N),Host.numa_current(). Same-argument constructions are interned (Host() is Host()). Used together withDeviceto express managed-memory locations.ManagedBuffer—Buffersubclass returned byManagedMemoryResource.allocate. Exposes a Pythonic property-style advice API on top of the same free functions. Wrap an external managed pointer withBuffer.from_handle(...)(now a@classmethod, soManagedBuffer.from_handle(...)returns aManagedBuffer).Closes #1332. Addresses the managed-memory portion of #1333 (P1:
cuMemPrefetchBatchAsync,cuMemDiscardBatchAsync,cuMemDiscardAndPrefetchBatchAsync). The P0cuMemcpyBatchAsyncfrom #1333 is intentionally out of scope and tracked separately; the holistic batched-API contract this PR commits to is documented in #issuecomment-4355502334 so the upcomingcuMemcpyBatchAsyncwork can mirror it.Public API
ManagedBuffer— property-style advice on managed allocationsManagedMemoryResource.allocatereturns aManagedBuffer(aBuffersubclass). All ManagedBuffer-specific behavior is layered on top of the free functions, so the two surfaces stay consistent.Free functions —
advise/prefetch/discard/discard_prefetchEach accepts a
Buffer(orManagedBuffer) or a sequence of them. Locations are expressed viaDeviceorHost.Batched form — same function, sequence of targets
When N>1, dispatch goes to the corresponding
cuMem*BatchAsync. Sequence locations are paired by index; a scalar location broadcasts to every target.Mismatched sequence lengths raise
ValueError. On a CUDA 12 build ofcuda.core, N>1 raisesNotImplementedError(the*BatchAsyncentry points are CUDA 13+); N==1 works on every supported toolkit.Putting it together
Implementation notes
cuda_core/cuda/core/_memory/_managed_memory_ops.pyxusescimport cydriverfor direct C-level driver calls.cuMemAdviseandcuMemPrefetchAsyncis handled at compile time withIF CUDA_CORE_BUILD_MAJOR >= 13:/ELSE:.cuMemPrefetchBatchAsync,cuMemDiscardBatchAsync,cuMemDiscardAndPrefetchBatchAsync) are CUDA 13+ only. On CUDA 12 builds, N>1 calls raiseNotImplementedError; single-buffer calls work everywhere.Hostis a singleton with__slots__and a__new__-based intern cache keyed by(numa_id, is_numa_current). Same-argument constructions return the same instance on both Python and Cython call paths.ManagedBufferis a pure-Python subclass of the CythonBuffercdef class.Buffer.from_handleis now a@classmethod(was@staticmethod) soMyBufferSubclass.from_handle(...)returns the typed instance viacls._init.Buffer_from_deviceptr_handleand_MP_allocatethread an optionalclsparameter soManagedMemoryResource.allocatematerializes aManagedBuffer._LocSpec(in_managed_location.py) carries the(kind, id)discriminator that the Cython layer maps toCUmemLocation(CUDA 13) or a legacy device ordinal (CUDA 12). Public callers see onlyDevice/Host;_coerce_locationproduces the internal record._buffer.pyxcollapsesout.is_managed = (is_managed != 0)to a single unconditional assignment and adds a TODO noting that HMM/ATS-mapped sysmem is not yet captured byCU_POINTER_ATTRIBUTE_IS_MANAGED.