Add Ray distributed training support and update requirements by bbkx226 · Pull Request #42 · codefuse-ai/CodeFuse-Embeddings

bbkx226 · 2025-12-14T08:29:14Z

Resolves #11

This pull request introduces Ray-based distributed training support to the project, enabling scalable multi-node and multi-GPU training with improved fault tolerance and checkpointing. The core addition is the new ray_run.py script, which implements a Ray Train-compatible training loop and integrates with Ray Data for efficient data loading. The documentation and requirements have been updated accordingly, and a new argument for gradient accumulation has been added.

Ray Distributed Training Integration:

Added a new script, ray_run.py, implementing distributed training using Ray Train and Torch DDP, with support for multi-node and multi-GPU setups, checkpointing, and integration with Ray Data for efficient batch loading. The training loop mirrors the existing accelerate-based approach but is adapted for Ray's distributed environment.
Updated requirements.txt to include Ray Train and related dependencies, with version constraints and platform-specific installation for flash-attn.

Documentation Updates:

Expanded the README.md with a new section on Ray distributed training, including installation steps, usage instructions, and notes on checkpointing and data compatibility.

Training Configuration Improvements:

Added a gradient_accumulation_steps argument to the Args class in arguments.py to support gradient accumulation in both Ray and accelerate pipelines.

bbkx226 · 2025-12-14T08:29:31Z

#11

Add Ray distributed training support and update requirements

9e333eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Ray distributed training support and update requirements#42

Add Ray distributed training support and update requirements#42
bbkx226 wants to merge 1 commit intocodefuse-ai:mainfrom
bbkx226:support_ray

bbkx226 commented Dec 14, 2025

Uh oh!

bbkx226 commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bbkx226 commented Dec 14, 2025

Uh oh!

bbkx226 commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant