saturncloud · GeoSegun · Oct 22, 2025 · Oct 24, 2025 · Nov 5, 2025 · Nov 6, 2025
diff --git a/examples/nlp_and_llms/nvidia-deepspeed/README.md b/examples/nlp_and_llms/nvidia-deepspeed/README.md
@@ -0,0 +1,97 @@
+# DeepSpeed ZeRO-3 Training
+
+![Memory Sharding Icon](./memory-sharding.png)
+
+This template provides a robust environment for training large-scale Transformer models (like GPT-2 Large) using **DeepSpeed ZeRO-Stage 3**. By partitioning model parameters, gradients, and optimizer states across multiple GPUs, this setup overcomes the memory limitations of a single device.
+
+For more information on the underlying platform, visit the [Saturn Cloud Documentation](https://saturncloud.io/docs/).
+## 📂 Project Structure
+
+* **`setup_saturn.sh`**: Environment initialization script to install DeepSpeed and dependencies.
+* **`src/train_transformers.py`**: Main training script using Hugging Face `Trainer` and DeepSpeed.
+* **`ds_config_zero3.json`**: Configuration file for ZeRO-3 sharding and CPU offloading.
+* **`run_job.sh`**: Distributed training launcher script.
+* **`test_inference.py`**: Optimized generation script using DeepSpeed Inference kernels.
+
+---
+
+## 🚀 Complete Procedure
+
+### 1. Environment Setup
+
+Before running any code, you must initialize the virtual environment to install the necessary DeepSpeed and Torch libraries. Refer to the [Saturn Cloud Docs for Setup Guide](https://saturncloud.io/docs/) for advanced configuration.
+
+```bash
+chmod +x setup_saturn.sh
+./setup_saturn.sh
+
+```
+
+### 2. Hardware Preparation
+
+To prevent filesystem errors during kernel compilation on Saturn Cloud's distributed architecture, create the Triton autotune directory:
+
+```bash
+mkdir -p /root/.triton/autotune
+
+```
+
+### 3. Training Execution
+
+Launch the training process across your GPUs using the provided job script:
+
+```bash
+./run_job.sh
+
+```
+
+* **The "Silent Phase"**: Note that ZeRO-3 requires a period of "silence" (usually 2-5 minutes for GPT-2) while it shards the model parameters before the first step appears.
+* **Automatic Consolidation**: The script is configured to automatically gather sharded 16-bit weights into a single `model.safetensors` file upon saving.
+
+### 4. Inference Testing
+
+After training completes and a checkpoint folder (e.g., `checkpoint-65`) is created, run the optimized inference test.
+
+**Update `test_inference.py`:**
+Ensure the `model_path` variable matches your checkpoint folder:
+
+```python
+model_path = "./checkpoints/checkpoint-65"
+
+```
+
+**Launch Inference:**
+Ensure the virtual environment is activated and then run the python test script:
+
+```bash
+python test_inference.py
+
+```
+
+---
+
+## 🛠️ Key Configurations
+
+### ZeRO-3 Optimization (`ds_config_zero3.json`)
+
+* **`stage3_gather_16bit_weights_on_model_save`**: Set to `true` to ensure your checkpoints are saved in a standard format for easy testing.
+* **`overlap_comm`**: Set to `false` in this template to maximize stability and prevent deadlocks on virtualized interconnects.
+
+### Training Stability (`src/train_transformers.py`)
+
+* **NCCL Flags**: The script forces `NCCL_P2P_DISABLE=1` to ensure reliable communication on cloud-based GPU clusters.
+* **Data Collator**: Uses `DataCollatorForLanguageModeling` to handle padding and ensure uniform tensor shapes during training, preventing "ValueError" crashes.
+
+---
+
+## 📈 Scaling Guide
+
+To scale from the verified test to a production-level run:
+
+1. **Model**: Change `model_id` to `"gpt2-large"` in `src/train_transformers.py`.
+2. **Dataset**: Remove the `[:1%]` slice to train on the full dataset.
+3. **Sequence Length**: Increase `max_length` to `512` or `1024` in the `tokenize_function`.
+
+For more community support, visit the [Saturn Cloud Community Slack](https://www.google.com/search?q=https://saturncloud.io/community/).
+
+---
diff --git a/examples/nlp_and_llms/nvidia-deepspeed/ds_config_zero3.json b/examples/nlp_and_llms/nvidia-deepspeed/ds_config_zero3.json
@@ -0,0 +1,20 @@
+{
+    "fp16": {
+        "enabled": "auto"
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "overlap_comm": false,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e6,
+        "reduce_bucket_size": 1e6,
+        "stage3_prefetch_bucket_size": 1e6,
+        "stage3_param_persistence_threshold": 1e5,
+        "stage3_max_live_parameters": 1e8,
+        "stage3_max_reuse_distance": 1e8,
+        "stage3_gather_16bit_weights_on_model_save": true
+    },
+    "train_micro_batch_size_per_gpu": "auto",
+    "gradient_accumulation_steps": "auto",
+    "steps_per_print": 10
+}
diff --git a/examples/nlp_and_llms/nvidia-deepspeed/memory-sharding.png b/examples/nlp_and_llms/nvidia-deepspeed/memory-sharding.png
diff --git a/examples/nlp_and_llms/nvidia-deepspeed/run_job.sh b/examples/nlp_and_llms/nvidia-deepspeed/run_job.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+source virt-env/bin/activate
+# Automatically uses all detected GPUs for ZeRO-3 sharding
+deepspeed src/train_transformers.py
diff --git a/examples/nlp_and_llms/nvidia-deepspeed/setup_saturn.sh b/examples/nlp_and_llms/nvidia-deepspeed/setup_saturn.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+# 1. Update system and install virtual environment tools
+apt-get update && apt-get install -y python3-venv python3-pip ninja-build
+
+# 2. Create and activate the virtual environment
+python3 -m venv virt-env
+source virt-env/bin/activate
+
+# 3. Install core dependencies
+# We recommend installing deepspeed from source for best hardware matching
+pip install --upgrade pip
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+pip install "transformers[deepspeed]>=4.31.0" datasets accelerate tqdm
+
+# 4. Optional: Install DeepSpeed with optimized ops
+# This step builds the C++/CUDA extensions required for high performance
+DS_BUILD_OPS=1 pip install deepspeed
+
+# 5. Pre-cache dataset to prevent network timeouts during training
+echo "📦 Pre-caching WikiText-2 dataset..."
+python3 -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-2-raw-v1', cache_dir='./data')"
+
+echo "✅ Saturn Cloud Environment Setup Complete."
diff --git a/examples/nlp_and_llms/nvidia-deepspeed/src/train_transformerrs.py b/examples/nlp_and_llms/nvidia-deepspeed/src/train_transformerrs.py
@@ -0,0 +1,71 @@
+import os
+import datetime
+import torch
+import torch.distributed as dist
+from transformers import (
+    AutoModelForCausalLM, 
+    AutoTokenizer, 
+    TrainingArguments, 
+    Trainer,
+    DataCollatorForLanguageModeling 
+)
+from datasets import load_dataset
+
+# Force NCCL stability on cloud instances
+os.environ["NCCL_P2P_DISABLE"] = "1"
+os.environ["NCCL_IB_DISABLE"] = "1"
+
+def main():
+    if not dist.is_initialized():
+        dist.init_process_group(backend="nccl", timeout=datetime.timedelta(minutes=10))
+
+    model_id = "gpt2"
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    # 1. Load tiny dataset
+    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")
+
+    # 2. Tokenize function with padding and truncation
+    def tokenize_function(examples):
+        return tokenizer(
+            examples["text"], 
+            truncation=True, 
+            max_length=128, # Keep small for fast test
+            padding="max_length"
+        )
+
+    # 3. Prepare data (Filter out empty rows to avoid errors)
+    dataset = dataset.filter(lambda x: len(x["text"]) > 5)
+    tokenized_ds = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)
+
+    # 4. Data Collator 
+    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
+
+    training_args = TrainingArguments(
+        output_dir="./checkpoints",
+        per_device_train_batch_size=2,
+        num_train_epochs=1,
+        deepspeed="ds_config_zero3.json", 
+        fp16=True,
+        logging_steps=1,
+        report_to="none"
+    )
+
+    # 5. Load Model
+    model = AutoModelForCausalLM.from_pretrained(model_id)
+    model.config.use_cache = False 
+
+    # 6. Initialize Trainer
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=tokenized_ds,
+        data_collator=data_collator # ADDED THIS
+    )
+
+    print("🚀 Launching Final Verified ZeRO-3 Engine...")
+    trainer.train()
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/nlp_and_llms/nvidia-deepspeed/test_inference.py b/examples/nlp_and_llms/nvidia-deepspeed/test_inference.py
@@ -0,0 +1,27 @@
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
+
+def test_generation():
+
+    model_path = "./checkpoints/checkpoint-65"
+
+    print(f"📦 Loading model from {model_path}...")
+    tokenizer = AutoTokenizer.from_pretrained("gpt2")
+    model = AutoModelForCausalLM.from_pretrained(model_path)
+
+    # Move to GPU if available
+    device = 0 if torch.cuda.is_available() else -1
+    generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)
+
+    # Test prompt from the WikiText domain
+    prompt = "The phenomenon of distributed computing allows"
+
+    print("🔮 Generating...")
+    output = generator(prompt, max_length=50, num_return_sequences=1, truncation=True)
+
+    print("\n--- GENERATED TEXT ---")
+    print(output[0]['generated_text'])
+    print("----------------------")
+
+if __name__ == "__main__":
+    test_generation()