Coll allocator op#27
Open
devreal wants to merge 13 commits into
Open
Conversation
Thread an mca_allocator_base_module_t *allocator parameter through all internal coll/base algorithm functions that allocate Pattern A scratch buffers (sized via opal_datatype_span, gap-adjusted, passed to PML or ompi_op_reduce). COLL_BASE_ALLOC/COLL_BASE_FREE macros dispatch to the allocator when non-NULL, or fall back to malloc/free when NULL. All existing callers (coll/tuned, coll/basic, coll/acoll) pass NULL, preserving current host-malloc behavior with no functional change. opal/mca/accelerator/base: add opal_accelerator_base_get_device_allocator() helper that returns a cached, per-device bucket allocator backed by opal_accelerator.mem_alloc/mem_release. Created lazily on first use. coll/tuned decision functions: for gather (scratch is data-movement only, no ompi_op_reduce), detect device buffers via opal_accelerator.check_addr and pass the device allocator to gather_intra_do_this. For all reduction operations (allreduce, reduce, reduce_scatter, reduce_scatter_block, scan, exscan) always pass NULL — device-side reduction via ompi_op_reduce is not yet supported. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tracking Replace the bucket allocator with the basic (first-fit + coalescing) allocator for per-device GPU scratch buffer pools. The basic allocator splits large free blocks to serve smaller requests and merges adjacent free blocks on release, giving good reuse across the varying scratch buffer sizes produced by collective algorithms. The per-device allocator array is now heap-allocated lazily on the first call to opal_accelerator_base_get_device_allocator, sized to the actual device count from opal_accelerator.num_devices(). The basic allocator's finalize does not call seg_free, so GPU segments would otherwise leak. Each GPU segment allocated via seg_alloc is now recorded in a per-context opal_list_t. On framework close, the list is drained first — calling opal_accelerator.mem_release on every segment — before alc_finalize cleans up the allocator's internal structures. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce ompi_op_gpu_session_t as the plumbing for future persistent GPU reduction kernels. Replace the mca_allocator_base_module_t *allocator parameter with ompi_op_gpu_session_t *session in all six reduction algorithm families (allreduce, reduce, reduce_scatter, reduce_scatter_block, scan, exscan) and their coll/tuned dispatch functions. Add COLL_BASE_REDUCE / COLL_SESSION_ALLOC / COLL_SESSION_FREE macros that dispatch to the GPU session when non-NULL, and fall back to ompi_op_reduce / malloc / free otherwise. Add optional opc_session_begin/reduce/end function pointers to ompi_op_base_component_t for future GPU op components. Phase 1 only: ompi_op_gpu_session_begin() is a stub that always returns NULL, so all code paths remain identical to before on host-only systems. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the single opc_session_end hook with three separate hooks (stop, restart, free) enabling a session freelist that avoids GPU resource reallocation between collective invocations. Pool lifecycle: - session_end() signals the persistent kernel to exit, synchronizes the stream, then pushes the session onto a flat dev_id-keyed freelist. GPU stream and managed memory remain allocated. - session_begin() pops a matching dev_id entry from the pool and calls restart_fn(session, op, dtype) to reset state and relaunch the appropriate persistent kernel — no cudaMalloc/hipMalloc overhead on the reuse path. If restart_fn returns false (no kernel for this op/dtype combination), the session is freed and NULL is returned. - pool_finalize() drains the freelist at MPI_Finalize, calling free_fn (releases stream, managed memory, priv) then free(session) for each entry. Update ompi/mca/op/cuda and ompi/mca/op/rocm components to implement the split session_stop / session_restart / session_free functions. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Order of popping local variables does matter, I guess. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
The CUDA/rocm queues inherit from the generic queue and provide callbacks. The generic code will release the queue once done. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: b912eb4: coll/base: replace allocator with GPU op session i...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Our component_open is never called. Bummer! Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: b912eb4: coll/base: replace allocator with GPU op session i...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.