Skip to content

Coll allocator op#27

Open
devreal wants to merge 13 commits into
mainfrom
coll-allocator-op
Open

Coll allocator op#27
devreal wants to merge 13 commits into
mainfrom
coll-allocator-op

Conversation

@devreal

@devreal devreal commented Jun 8, 2026

Copy link
Copy Markdown
Owner

No description provided.

devreal and others added 12 commits May 25, 2026 15:13
Thread an mca_allocator_base_module_t *allocator parameter through all
internal coll/base algorithm functions that allocate Pattern A scratch
buffers (sized via opal_datatype_span, gap-adjusted, passed to PML or
ompi_op_reduce). COLL_BASE_ALLOC/COLL_BASE_FREE macros dispatch to the
allocator when non-NULL, or fall back to malloc/free when NULL.

All existing callers (coll/tuned, coll/basic, coll/acoll) pass NULL,
preserving current host-malloc behavior with no functional change.

opal/mca/accelerator/base: add opal_accelerator_base_get_device_allocator()
helper that returns a cached, per-device bucket allocator backed by
opal_accelerator.mem_alloc/mem_release. Created lazily on first use.

coll/tuned decision functions: for gather (scratch is data-movement only,
no ompi_op_reduce), detect device buffers via opal_accelerator.check_addr
and pass the device allocator to gather_intra_do_this. For all reduction
operations (allreduce, reduce, reduce_scatter, reduce_scatter_block, scan,
exscan) always pass NULL — device-side reduction via ompi_op_reduce is
not yet supported.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tracking

Replace the bucket allocator with the basic (first-fit + coalescing)
allocator for per-device GPU scratch buffer pools.  The basic allocator
splits large free blocks to serve smaller requests and merges adjacent
free blocks on release, giving good reuse across the varying scratch
buffer sizes produced by collective algorithms.

The per-device allocator array is now heap-allocated lazily on the first
call to opal_accelerator_base_get_device_allocator, sized to the actual
device count from opal_accelerator.num_devices().

The basic allocator's finalize does not call seg_free, so GPU segments
would otherwise leak.  Each GPU segment allocated via seg_alloc is now
recorded in a per-context opal_list_t.  On framework close, the list is
drained first — calling opal_accelerator.mem_release on every segment —
before alc_finalize cleans up the allocator's internal structures.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce ompi_op_gpu_session_t as the plumbing for future persistent
GPU reduction kernels. Replace the mca_allocator_base_module_t *allocator
parameter with ompi_op_gpu_session_t *session in all six reduction
algorithm families (allreduce, reduce, reduce_scatter, reduce_scatter_block,
scan, exscan) and their coll/tuned dispatch functions.

Add COLL_BASE_REDUCE / COLL_SESSION_ALLOC / COLL_SESSION_FREE macros that
dispatch to the GPU session when non-NULL, and fall back to ompi_op_reduce /
malloc / free otherwise. Add optional opc_session_begin/reduce/end function
pointers to ompi_op_base_component_t for future GPU op components.

Phase 1 only: ompi_op_gpu_session_begin() is a stub that always returns NULL,
so all code paths remain identical to before on host-only systems.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the single opc_session_end hook with three separate hooks
(stop, restart, free) enabling a session freelist that avoids GPU
resource reallocation between collective invocations.

Pool lifecycle:
- session_end() signals the persistent kernel to exit, synchronizes
  the stream, then pushes the session onto a flat dev_id-keyed freelist.
  GPU stream and managed memory remain allocated.
- session_begin() pops a matching dev_id entry from the pool and calls
  restart_fn(session, op, dtype) to reset state and relaunch the
  appropriate persistent kernel — no cudaMalloc/hipMalloc overhead on
  the reuse path.  If restart_fn returns false (no kernel for this
  op/dtype combination), the session is freed and NULL is returned.
- pool_finalize() drains the freelist at MPI_Finalize, calling free_fn
  (releases stream, managed memory, priv) then free(session) for each
  entry.

Update ompi/mca/op/cuda and ompi/mca/op/rocm components to implement
the split session_stop / session_restart / session_free functions.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Order of popping local variables does matter, I guess.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
The CUDA/rocm queues inherit from the generic queue and provide callbacks.
The generic code will release the queue once done.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b912eb4: coll/base: replace allocator with GPU op session i...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Our component_open is never called. Bummer!

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b912eb4: coll/base: replace allocator with GPU op session i...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant