Skip to content

Add a specific device_id when calling torch.init_process_group to avoid NCCL communication errors.#76

Open
gongweibao wants to merge 1 commit intoFoundationVision:mainfrom
gongweibao:fixncclinit
Open

Add a specific device_id when calling torch.init_process_group to avoid NCCL communication errors.#76
gongweibao wants to merge 1 commit intoFoundationVision:mainfrom
gongweibao:fixncclinit

Conversation

@gongweibao
Copy link

@gongweibao gongweibao commented Mar 5, 2025

Such as

using GPU  to perform barrier as devices used by this process are currently unknown. 
This can potentially cause a hang if this rank to GPU mapping is incorrect. 
Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id

@ZeroRF
Copy link

ZeroRF commented Jun 11, 2025

It works! I have been confused by NCCL Error for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants