Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions examples/nlp_and_llms/nvidia-deepspeed/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# DeepSpeed ZeRO-3 Training

![Memory Sharding Icon](./memory-sharding.png)

This template provides a robust environment for training large-scale Transformer models (like GPT-2 Large) using **DeepSpeed ZeRO-Stage 3**. By partitioning model parameters, gradients, and optimizer states across multiple GPUs, this setup overcomes the memory limitations of a single device.

For more information on the underlying platform, visit the [Saturn Cloud Documentation](https://saturncloud.io/docs/).
## 📂 Project Structure

* **`setup_saturn.sh`**: Environment initialization script to install DeepSpeed and dependencies.
* **`src/train_transformers.py`**: Main training script using Hugging Face `Trainer` and DeepSpeed.
* **`ds_config_zero3.json`**: Configuration file for ZeRO-3 sharding and CPU offloading.
* **`run_job.sh`**: Distributed training launcher script.
* **`test_inference.py`**: Optimized generation script using DeepSpeed Inference kernels.

---

## 🚀 Complete Procedure

### 1. Environment Setup

Before running any code, you must initialize the virtual environment to install the necessary DeepSpeed and Torch libraries. Refer to the [Saturn Cloud Docs for Setup Guide](https://saturncloud.io/docs/) for advanced configuration.

```bash
chmod +x setup_saturn.sh
./setup_saturn.sh

```

### 2. Hardware Preparation

To prevent filesystem errors during kernel compilation on Saturn Cloud's distributed architecture, create the Triton autotune directory:

```bash
mkdir -p /root/.triton/autotune

```

### 3. Training Execution

Launch the training process across your GPUs using the provided job script:

```bash
./run_job.sh

```

* **The "Silent Phase"**: Note that ZeRO-3 requires a period of "silence" (usually 2-5 minutes for GPT-2) while it shards the model parameters before the first step appears.
* **Automatic Consolidation**: The script is configured to automatically gather sharded 16-bit weights into a single `model.safetensors` file upon saving.

### 4. Inference Testing

After training completes and a checkpoint folder (e.g., `checkpoint-65`) is created, run the optimized inference test.

**Update `test_inference.py`:**
Ensure the `model_path` variable matches your checkpoint folder:

```python
model_path = "./checkpoints/checkpoint-65"

```

**Launch Inference:**
Ensure the virtual environment is activated and then run the python test script:

```bash
python test_inference.py

```

---

## 🛠️ Key Configurations

### ZeRO-3 Optimization (`ds_config_zero3.json`)

* **`stage3_gather_16bit_weights_on_model_save`**: Set to `true` to ensure your checkpoints are saved in a standard format for easy testing.
* **`overlap_comm`**: Set to `false` in this template to maximize stability and prevent deadlocks on virtualized interconnects.

### Training Stability (`src/train_transformers.py`)

* **NCCL Flags**: The script forces `NCCL_P2P_DISABLE=1` to ensure reliable communication on cloud-based GPU clusters.
* **Data Collator**: Uses `DataCollatorForLanguageModeling` to handle padding and ensure uniform tensor shapes during training, preventing "ValueError" crashes.

---

## 📈 Scaling Guide

To scale from the verified test to a production-level run:

1. **Model**: Change `model_id` to `"gpt2-large"` in `src/train_transformers.py`.
2. **Dataset**: Remove the `[:1%]` slice to train on the full dataset.
3. **Sequence Length**: Increase `max_length` to `512` or `1024` in the `tokenize_function`.

For more community support, visit the [Saturn Cloud Community Slack](https://www.google.com/search?q=https://saturncloud.io/community/).

---
20 changes: 20 additions & 0 deletions examples/nlp_and_llms/nvidia-deepspeed/ds_config_zero3.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"fp16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": false,
"contiguous_gradients": true,
"sub_group_size": 1e6,
"reduce_bucket_size": 1e6,
"stage3_prefetch_bucket_size": 1e6,
"stage3_param_persistence_threshold": 1e5,
"stage3_max_live_parameters": 1e8,
"stage3_max_reuse_distance": 1e8,
"stage3_gather_16bit_weights_on_model_save": true
},
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"steps_per_print": 10
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions examples/nlp_and_llms/nvidia-deepspeed/run_job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash
source virt-env/bin/activate
# Automatically uses all detected GPUs for ZeRO-3 sharding
deepspeed src/train_transformers.py
23 changes: 23 additions & 0 deletions examples/nlp_and_llms/nvidia-deepspeed/setup_saturn.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
# 1. Update system and install virtual environment tools
apt-get update && apt-get install -y python3-venv python3-pip ninja-build

# 2. Create and activate the virtual environment
python3 -m venv virt-env
source virt-env/bin/activate

# 3. Install core dependencies
# We recommend installing deepspeed from source for best hardware matching
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install "transformers[deepspeed]>=4.31.0" datasets accelerate tqdm

# 4. Optional: Install DeepSpeed with optimized ops
# This step builds the C++/CUDA extensions required for high performance
DS_BUILD_OPS=1 pip install deepspeed

# 5. Pre-cache dataset to prevent network timeouts during training
echo "📦 Pre-caching WikiText-2 dataset..."
python3 -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-2-raw-v1', cache_dir='./data')"

echo "✅ Saturn Cloud Environment Setup Complete."
71 changes: 71 additions & 0 deletions examples/nlp_and_llms/nvidia-deepspeed/src/train_transformerrs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import os
import datetime
import torch
import torch.distributed as dist
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling
)
from datasets import load_dataset

# Force NCCL stability on cloud instances
os.environ["NCCL_P2P_DISABLE"] = "1"
os.environ["NCCL_IB_DISABLE"] = "1"

def main():
if not dist.is_initialized():
dist.init_process_group(backend="nccl", timeout=datetime.timedelta(minutes=10))

model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# 1. Load tiny dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")

# 2. Tokenize function with padding and truncation
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=128, # Keep small for fast test
padding="max_length"
)

# 3. Prepare data (Filter out empty rows to avoid errors)
dataset = dataset.filter(lambda x: len(x["text"]) > 5)
tokenized_ds = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)

# 4. Data Collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
output_dir="./checkpoints",
per_device_train_batch_size=2,
num_train_epochs=1,
deepspeed="ds_config_zero3.json",
fp16=True,
logging_steps=1,
report_to="none"
)

# 5. Load Model
model = AutoModelForCausalLM.from_pretrained(model_id)
model.config.use_cache = False

# 6. Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_ds,
data_collator=data_collator # ADDED THIS
)

print("🚀 Launching Final Verified ZeRO-3 Engine...")
trainer.train()

if __name__ == "__main__":
main()
27 changes: 27 additions & 0 deletions examples/nlp_and_llms/nvidia-deepspeed/test_inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

def test_generation():

model_path = "./checkpoints/checkpoint-65"

print(f"📦 Loading model from {model_path}...")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(model_path)

# Move to GPU if available
device = 0 if torch.cuda.is_available() else -1
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)

# Test prompt from the WikiText domain
prompt = "The phenomenon of distributed computing allows"

print("🔮 Generating...")
output = generator(prompt, max_length=50, num_return_sequences=1, truncation=True)

print("\n--- GENERATED TEXT ---")
print(output[0]['generated_text'])
print("----------------------")

if __name__ == "__main__":
test_generation()
Loading
Loading