Skip to content

Add IReduce! and IAllreduce!#827

Open
Keluaa wants to merge 10 commits intoJuliaParallel:masterfrom
Keluaa:nonblocking_reduction
Open

Add IReduce! and IAllreduce!#827
Keluaa wants to merge 10 commits intoJuliaParallel:masterfrom
Keluaa:nonblocking_reduction

Conversation

@Keluaa
Copy link
Contributor

@Keluaa Keluaa commented Mar 13, 2024

Added basic nonblocking reductions, alongside with some tests.

@vchuravy vchuravy force-pushed the nonblocking_reduction branch from 36530c4 to fafe00e Compare June 23, 2024 21:49
@Keluaa
Copy link
Contributor Author

Keluaa commented Jun 24, 2024

Mmm... only the CUDA tests fail. I feared this was a Julia-side issue, but no, it is simply that MPI_Iallreduce is not CUDA-aware in OpenMPI.

This is a known issue (open-mpi/ompi#9845), and it also happens with MPI_Ireduce (open-mpi/ompi#12045).
I was also able to reproduce it in C.

It is very surprising to me that the ROCm support apparently covers all non-blocking ops, but not CUDA.

What would be the best course of action? Merge anyway and let users stumble upon an unhelpful segfault? Or would a warning (if using OpenMPI + CUDA is loaded) be enough?

@vchuravy
Copy link
Member

Yeah we don't currently have a good mechanism to declare which operations can and can not take GPU memory.
It seems even worse for OpenMPI since the set of supported operations is depending on whether or not UCX is used.

We certainly need to branch in the tests, but I don't think we have prior art for this.

@simonbyrne any ideas?

@simonbyrne
Copy link
Member

simonbyrne commented Jun 25, 2024

Unfortunately it is probably implementation (and configuration) dependent, so I don't think we can provide a complete solution. My best suggestion would be to make it so the test suite can soft fail and report which operations are supported.

If you want something easy that does work, the simplest option is to use the regular blocking operation spawned on a separate thread:

task = Threads.@spawn MPI.Allreduce(....)
# other work
wait(task)

If your other work involves MPI ops, you will also need to MPI.Init(threadlevel=:multiple).

@PetrKryslUCSD PetrKryslUCSD mentioned this pull request Nov 18, 2024
@Keluaa Keluaa force-pushed the nonblocking_reduction branch from fafe00e to 311629f Compare November 18, 2024 13:34
@Keluaa
Copy link
Contributor Author

Keluaa commented Nov 18, 2024

Thank you @PetrKryslUCSD for reminding me about this PR.

My original solution to detect support for IAllreduce was unsatisfactory, so I put it aside and then forgot about it.

The main problem is that MPI raises a SIGSEGV when IAllreduce/Ireduce is called with a CUDA array, so we cannot use threads or other Julia tricks to detect the issue.
I also don't know if it possible to disable the tests when OpenMPI + UCX + CUDA is used, as I do not know how to detect this.

My proposed solution is to call those problematic functions at the beginning of the test suite in a separate process, and set env vars appropriately when they work.
The other tests can use them from everywhere else in the test suite.

Since some RMA functions and other non-blocking collectives suffer from the same problem, if we were to add them to MPI.jl, it is easy to extend this to know if they work.

@Keluaa
Copy link
Contributor Author

Keluaa commented Nov 18, 2024

The CUDA and AMDGPU tests on Julia 1.10 are all passing, however it detects IAllreduce and IReduce as non-functional for both, which seem correct since from the segfault message, but it was previously working with AMDGPU a few months ago.
On my local cluster, with ROCm 6.0.2, OpenMPI 4.1.2 and a Mi250 GPU, those two functions are working correctly.
I don't know where the issue could come from...

The tests failing with Julia nightly on CPU and for Julia 1.6-1.8 with CUDA seems unrelated to this PR.

@Keluaa
Copy link
Contributor Author

Keluaa commented Nov 21, 2024

I just realized that tests on AMDGPU have JULIA_MPI_TEST_EXCLUDE set to "test_allreduce.jl,test_reduce.jl,test_scan.jl", therefore those tests simply never ran from the start...

Is it related to the fact that those operations are unsupported on ROCm? Because if they are we could detect this using my method and exclude those tests automatically.

@johnomotani
Copy link

johnomotani commented Feb 10, 2026

These non-blocking operators (also #882) would still be nice to have. Is there any chance of getting this merged? Thanks!

PS I'm happy to try and help if there's still work to do, but unfortunately I'm not familiar with CUDA which seems to have been the problem...

@johnomotani
Copy link

For consistency with the MPI naming, should these functions be capitalised as Ireduce! and Iallreduce!?

@Keluaa Keluaa force-pushed the nonblocking_reduction branch from 2c74cdf to 275d21d Compare February 11, 2026 09:01
@Keluaa
Copy link
Contributor Author

Keluaa commented Feb 11, 2026

I went ahead and enabled testing reductions on AMDGPU, not sure if it will work, as currently I don't have access to my local cluster.

@Keluaa
Copy link
Contributor Author

Keluaa commented Feb 11, 2026

Support checks do work, but MPI_Reduce and MPI_Allreduce do not work as well on AMDGPU.
Now all support for sync/async reductions are properly checked before testing.

@Keluaa Keluaa force-pushed the nonblocking_reduction branch from 0825141 to 0b4d0ec Compare February 11, 2026 11:11
@Keluaa Keluaa force-pushed the nonblocking_reduction branch from 0b4d0ec to c594d25 Compare February 11, 2026 11:38
@Keluaa
Copy link
Contributor Author

Keluaa commented Mar 6, 2026

I merged the other contributions for Ibcast! of #882 as it also tried to implement Iallreduce!.

Now all CPU test failures are related to MPI.Gather!, which is unreleated to this PR.
AMDGPU tests for Julia 1.10 did timeout, although a previous CI run did pass without failures.
All GPU tests passed for Julia 1.11.

@vchuravy is it ok to merge, or does this PR need more work?
I couldn't reproduce the AMDGPU failure on the cluster I have access to, therefore I doubt I can debug the problem.
If this is merged then #882 can be closed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants