Coll allocator op by devreal · Pull Request #27 · devreal/ompi

devreal · 2026-06-08T15:59:09Z

No description provided.

Thread an mca_allocator_base_module_t *allocator parameter through all internal coll/base algorithm functions that allocate Pattern A scratch buffers (sized via opal_datatype_span, gap-adjusted, passed to PML or ompi_op_reduce). COLL_BASE_ALLOC/COLL_BASE_FREE macros dispatch to the allocator when non-NULL, or fall back to malloc/free when NULL. All existing callers (coll/tuned, coll/basic, coll/acoll) pass NULL, preserving current host-malloc behavior with no functional change. opal/mca/accelerator/base: add opal_accelerator_base_get_device_allocator() helper that returns a cached, per-device bucket allocator backed by opal_accelerator.mem_alloc/mem_release. Created lazily on first use. coll/tuned decision functions: for gather (scratch is data-movement only, no ompi_op_reduce), detect device buffers via opal_accelerator.check_addr and pass the device allocator to gather_intra_do_this. For all reduction operations (allreduce, reduce, reduce_scatter, reduce_scatter_block, scan, exscan) always pass NULL — device-side reduction via ompi_op_reduce is not yet supported. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tracking Replace the bucket allocator with the basic (first-fit + coalescing) allocator for per-device GPU scratch buffer pools. The basic allocator splits large free blocks to serve smaller requests and merges adjacent free blocks on release, giving good reuse across the varying scratch buffer sizes produced by collective algorithms. The per-device allocator array is now heap-allocated lazily on the first call to opal_accelerator_base_get_device_allocator, sized to the actual device count from opal_accelerator.num_devices(). The basic allocator's finalize does not call seg_free, so GPU segments would otherwise leak. Each GPU segment allocated via seg_alloc is now recorded in a per-context opal_list_t. On framework close, the list is drained first — calling opal_accelerator.mem_release on every segment — before alc_finalize cleans up the allocator's internal structures. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduce ompi_op_gpu_session_t as the plumbing for future persistent GPU reduction kernels. Replace the mca_allocator_base_module_t *allocator parameter with ompi_op_gpu_session_t *session in all six reduction algorithm families (allreduce, reduce, reduce_scatter, reduce_scatter_block, scan, exscan) and their coll/tuned dispatch functions. Add COLL_BASE_REDUCE / COLL_SESSION_ALLOC / COLL_SESSION_FREE macros that dispatch to the GPU session when non-NULL, and fall back to ompi_op_reduce / malloc / free otherwise. Add optional opc_session_begin/reduce/end function pointers to ompi_op_base_component_t for future GPU op components. Phase 1 only: ompi_op_gpu_session_begin() is a stub that always returns NULL, so all code paths remain identical to before on host-only systems. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the single opc_session_end hook with three separate hooks (stop, restart, free) enabling a session freelist that avoids GPU resource reallocation between collective invocations. Pool lifecycle: - session_end() signals the persistent kernel to exit, synchronizes the stream, then pushes the session onto a flat dev_id-keyed freelist. GPU stream and managed memory remain allocated. - session_begin() pops a matching dev_id entry from the pool and calls restart_fn(session, op, dtype) to reset state and relaunch the appropriate persistent kernel — no cudaMalloc/hipMalloc overhead on the reuse path. If restart_fn returns false (no kernel for this op/dtype combination), the session is freed and NULL is returned. - pool_finalize() drains the freelist at MPI_Finalize, calling free_fn (releases stream, managed memory, priv) then free(session) for each entry. Update ompi/mca/op/cuda and ompi/mca/op/rocm components to implement the split session_stop / session_restart / session_free functions. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Order of popping local variables does matter, I guess. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

The CUDA/rocm queues inherit from the generic queue and provide callbacks. The generic code will release the queue once done. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

github-actions · 2026-06-08T15:59:44Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b912eb4: coll/base: replace allocator with GPU op session i...

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Our component_open is never called. Bummer! Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

github-actions · 2026-06-08T17:52:06Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b912eb4: coll/base: replace allocator with GPU op session i...

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

devreal and others added 12 commits May 25, 2026 15:13

op/rocm: fix build integration

968b60d

Order of popping local variables does matter, I guess. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

coll/tuned: Allocate sessions and allocators for coll/base calls

e59d22e

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Cache expensive allocation parts of a GPU session

be44a1e

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Fix compile issues

d8ccda7

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Introduce inheritance for op queues

2f34ecf

The CUDA/rocm queues inherit from the generic queue and provide callbacks. The generic code will release the queue once done. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Make op/cuda and op/rocm dso-by-default

65e65ff

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Fix NVCC include paths

5733226

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Fix CUDA build, static initializers not supported

24af1d2

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Register launcher callbacks in init_query

ce49af1

Our component_open is never called. Bummer! Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coll allocator op#27

Coll allocator op#27
devreal wants to merge 13 commits into
mainfrom
coll-allocator-op

devreal commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devreal commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant