feat: Add Ray distributed training framework integration (#11)#19
Open
SeasonPilot wants to merge 1 commit intocodefuse-ai:mainfrom
Open
feat: Add Ray distributed training framework integration (#11)#19SeasonPilot wants to merge 1 commit intocodefuse-ai:mainfrom
SeasonPilot wants to merge 1 commit intocodefuse-ai:mainfrom
Conversation
) Implements Ray Train support for scalable, fault-tolerant distributed training while maintaining 100% backward compatibility with existing Accelerate workflow. Core Changes: - Add RayTrainConfig configuration class with YAML loading support - Implement DistributedContext abstraction layer for dual backend support - Create ray_train.py main training script with RayF2LLMTrainer wrapper - Add run_ray_train.py launcher script with CLI interface - Update model.py device handling for framework compatibility - Add unit tests for DistributedContext with both backends Features: - Seamless scaling from single-node to multi-node clusters - Automatic fault tolerance and recovery mechanisms - DeepSpeed ZeRO-2 integration for memory optimization - HuggingFace checkpoint format compatibility - Unified distributed operations API Technical Details: - Backend auto-detection (Ray Train vs Accelerate) - PyTorch DDP with NCCL backend for GPU communication - Flash Attention 2 support - Multi-dataset weighted sampling (50+ datasets) - TensorBoard logging integration Testing: - Unit tests for DistributedContext dual backend functionality - Test coverage for gather, sync, prepare, unwrap operations - Mock-based testing for both Ray and Accelerate backends Dependencies: - ray[train]>=2.30.0 - pyyaml>=6.0 - transformers>=4.51.0
Author
1 similar comment
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#11
Implements Ray Train support for scalable, fault-tolerant distributed training while maintaining 100% backward compatibility with existing Accelerate workflow.
Core Changes:
Features:
Technical Details:
Testing:
Dependencies: