Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
7ff51dc
add workflow to run full tests
tohtana Jan 17, 2026
9b556a5
Test with -n 1 to debug parallel execution issues
tohtana Jan 17, 2026
96d3674
Merge: set TORCH_CUDA_ARCH_LIST=8.9 and use -n 1 for debugging
tohtana Jan 17, 2026
e827532
Fix bf16 checkpoint optimizer state and muon test
tohtana Jan 17, 2026
fa213d0
Skip aio tests that hang in CI environment
tohtana Jan 17, 2026
9195771
Skip more hanging ops tests in CI
tohtana Jan 17, 2026
0e52d3a
Fix ulysses PEFT test to use mpu object instead of global groups
tohtana Jan 17, 2026
b77f9c8
Skip pipeline parallelism tests that timeout in CI
tohtana Jan 17, 2026
29eff21
Skip CPU adam tests that timeout in CI
tohtana Jan 17, 2026
e0e1cab
Skip zenflow tests that timeout in CI
tohtana Jan 17, 2026
70d90c5
Skip pipeline checkpoint tests that timeout in CI
tohtana Jan 17, 2026
5af1f37
Skip test_multiple_models.py that timeouts
tohtana Jan 17, 2026
a3a9f8e
Run tests in parallel with -n 4 instead of sequential
tohtana Jan 17, 2026
02d5ed7
Skip onebit tests that timeout with pipeline config
tohtana Jan 17, 2026
0fd0a10
Skip test_ds_initialize.py tests that timeout
tohtana Jan 17, 2026
a6fb0cb
Skip test_zero_leaf_module.py tests that timeout
tohtana Jan 17, 2026
35ec1e2
Skip test_zero_tensor_fragment.py tests that timeout
tohtana Jan 17, 2026
78d0580
Skip test_mup_optimizers.py tests that timeout
tohtana Jan 17, 2026
6077494
Skip test_user_args.py shell quoting edge cases
tohtana Jan 17, 2026
ad2a74d
Skip nvme checkpointing tests (no nvme device in CI)
tohtana Jan 17, 2026
bcdfe3d
Enable async I/O tests with DS_DISABLE_REUSE_DIST_ENV
tohtana Jan 17, 2026
ee61faa
Remove test ignores to validate DS_DISABLE_REUSE_DIST_ENV fix
tohtana Jan 17, 2026
bfa4831
Fix: Use /mnt/aio/pytest subdirectory for basetemp
tohtana Jan 17, 2026
6b8290a
fix(pipeline): set _running_engine_backward for non-last stage backward
tohtana Jan 17, 2026
3d137c6
Skip GDS tests in CI (no GPUDirect Storage hardware)
tohtana Jan 18, 2026
30a80d0
Install pdsh for launcher tests
tohtana Jan 18, 2026
c6e6008
Add pdsh, skip zenflow tests (timeout)
tohtana Jan 18, 2026
dfc7834
fix: BF16_Optimizer selection and compatibility issues
tohtana Jan 18, 2026
505ffa6
fix: skip empty parameters in gradient reduction
tohtana Jan 18, 2026
a02bc6e
fix(test): add bf16 model with fp32 grad_accum to supported configs
tohtana Jan 18, 2026
12a6e95
ci: increase parallel test workers to 8
tohtana Jan 18, 2026
121c7e0
ci: enable zenflow tests
tohtana Jan 18, 2026
5daced1
ci: skip launcher tests requiring SSH
tohtana Jan 18, 2026
274d361
Skip zenflow tests due to pre-existing Stage 3 bugs
tohtana Jan 18, 2026
ba296eb
Skip ZenFlow torch adam test (CUDA/fork incompatibility)
tohtana Jan 18, 2026
c993e84
Mark manual dist init tests as sequential to avoid port conflicts
tohtana Jan 18, 2026
ded0436
Add debug test for RowParallel numerical differences
tohtana Jan 18, 2026
ee5e166
Update debug workflow to run testRowParallel with multiple seeds and …
tohtana Jan 18, 2026
248cfd1
Debug: Run testRowParallel and sequential tests with multiple seeds
tohtana Jan 18, 2026
667157f
Debug: Fix testRowParallel selection and use assert_close for diagnos…
tohtana Jan 18, 2026
7a5eebd
Debug: Also update testColumnParallel to use assert_close
tohtana Jan 18, 2026
b9a5e99
Fix autoTP test numerical tolerance with assert_close
tohtana Jan 18, 2026
0675504
Fix Evoformer compilation (#7760)
sdvillal Jan 18, 2026
a4500e6
fix format
tohtana Jan 18, 2026
e668921
Temp: Add CUTLASS and run only Evoformer tests
tohtana Jan 18, 2026
f61ed51
Fix: Remove --forked from Evoformer test to avoid CUDA fork issue
tohtana Jan 18, 2026
dca165b
Add CUTLASS support and mark Evoformer test as sequential
tohtana Jan 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions .github/workflows/aws-torch-latest-full.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
################################################################################
# DeepSpeed CI - AWS L40S GPU Tests (PyTorch Latest - Full Unit Tests)
#
# Migrated from nv-torch-latest-v100.yml which ran on deprecated V100 cluster.
# Runs the full unit test suite (tests/unit/) on AWS self-hosted runners.
# Manual trigger only (workflow_dispatch).
################################################################################

name: aws-torch-latest-full

on:
workflow_dispatch:

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
unit-tests:
name: Unit Tests (Full)
runs-on: [self-hosted, gpu-ci, gpu-l40s, l40s-4gpu, aws]
timeout-minutes: 60

container:
image: nvidia/cuda:12.6.3-devel-ubuntu22.04
# Mount /mnt/aio for async I/O tests (O_DIRECT requires native filesystem, not overlayfs)
options: --gpus all --shm-size "32G" -v /mnt/aio:/mnt/aio

env:
TORCH_VER: "2.7"
CUDA_VER: "12.6"
# Disable reuse_dist_env to prevent pool worker cleanup hangs in full test runs
DS_DISABLE_REUSE_DIST_ENV: "1"

steps:
- name: Install system dependencies
run: |
apt-get update && apt-get install -y git git-lfs libaio-dev python3 python3-pip
git lfs install
ln -sf /usr/bin/python3 /usr/bin/python

- name: Checkout repository
uses: actions/checkout@v4
with:
lfs: true

- name: Install PyTorch
run: |
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu126

- name: Install transformers
run: |
git clone https://github.com/huggingface/transformers
cd transformers
# if needed switch to the last known good SHA until transformers@master is fixed
git checkout 981c276
git rev-parse --short HEAD
pip install .

- name: Install Python dependencies
run: |
pip install --upgrade pip
pip install -r requirements/requirements.txt
pip install -r requirements/requirements-dev.txt
pip install -r requirements/requirements-deepcompile.txt
pip install pytest-timeout pytest-instafail

- name: Check environment
run: |
echo "=== GPU Information ==="
nvidia-smi
echo ""
echo "=== CUDA Version ==="
nvcc --version
echo ""
echo "=== Python/PyTorch Info ==="
python --version
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')"
python -c "import torch; print(f'BF16 support: {torch.cuda.is_bf16_supported()}')"

- name: Install DeepSpeed
run: |
# Initialize CUDA before install so setup.py can detect NCCL version
python -c "import torch; torch.cuda.init(); print(f'NCCL version: {torch.cuda.nccl.version()}')"
# Use --no-build-isolation so setup.py can access pre-installed PyTorch
pip install --no-build-isolation .[dev,1bit,autotuning,deepcompile]
ds_report
# Debug: Check captured torch_info values
python -c "from deepspeed.git_version_info import torch_info; print(f'torch_info: {torch_info}')"

- name: Python environment
run: |
pip list

- name: Unit tests
run: |
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
# Use -n 1 to run tests sequentially to avoid parallel execution issues
# Use /mnt/aio/pytest as basetemp for O_DIRECT support in aio tests
rm -rf /mnt/aio/pytest
pytest -x --instafail --timeout 600 --forked -n 1 --basetemp=/mnt/aio/pytest unit/ --torch_ver=${{ env.TORCH_VER }} --cuda_ver=${{ env.CUDA_VER }}
rm -rf /mnt/aio/pytest
pytest --instafail --timeout 600 --forked -m 'sequential' --basetemp=/mnt/aio/pytest unit/ --torch_ver=${{ env.TORCH_VER }} --cuda_ver=${{ env.CUDA_VER }}
64 changes: 55 additions & 9 deletions .github/workflows/aws-torch-latest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,24 +42,28 @@ jobs:
- '!tests/unit/inference/v2/**'

unit-tests:
name: Unit Tests (V1)
name: Unit Tests (Full)
needs: check-paths
if: needs.check-paths.outputs.should_run == 'true'
runs-on: [self-hosted, gpu-ci, gpu-l40s, l40s-4gpu, aws]
timeout-minutes: 60
timeout-minutes: 180

container:
image: nvidia/cuda:12.6.3-devel-ubuntu22.04
options: --gpus all --shm-size "32G"
# Mount /mnt/aio for async I/O tests (O_DIRECT requires native filesystem, not overlayfs)
options: --gpus all --shm-size "32G" -v /mnt/aio:/mnt/aio

env:
TORCH_VER: "2.7"
CUDA_VER: "12.6"
CUTLASS_PATH: /opt/cutlass
# Disable reuse_dist_env to prevent pool worker cleanup hangs in full test runs
DS_DISABLE_REUSE_DIST_ENV: "1"

steps:
- name: Install system dependencies
run: |
apt-get update && apt-get install -y git git-lfs libaio-dev python3 python3-pip
apt-get update && apt-get install -y git git-lfs libaio-dev pdsh python3 python3-pip
git lfs install
ln -sf /usr/bin/python3 /usr/bin/python

Expand All @@ -68,16 +72,30 @@ jobs:
with:
lfs: true

- name: Install CUTLASS
run: |
git clone --depth 1 --branch v3.5.1 https://github.com/NVIDIA/cutlass.git /opt/cutlass
echo "CUTLASS installed at /opt/cutlass"
ls -la /opt/cutlass/include/ | head -10

- name: Install PyTorch
run: |
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu126

- name: Install transformers
run: |
git clone https://github.com/huggingface/transformers
cd transformers
git checkout 981c276
pip install .

- name: Install Python dependencies
run: |
pip install --upgrade pip
pip install -r requirements/requirements.txt
pip install -r requirements/requirements-dev.txt
pip install -r requirements/requirements-deepcompile.txt
pip install pytest-timeout pytest-instafail

- name: Check environment
run: |
Expand All @@ -93,17 +111,45 @@ jobs:
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')"
python -c "import torch; print(f'BF16 support: {torch.cuda.is_bf16_supported()}')"
echo ""
echo "=== CUTLASS ==="
echo "CUTLASS_PATH: $CUTLASS_PATH"
ls -la $CUTLASS_PATH/include/ | head -5

- name: Install DeepSpeed
run: |
# Initialize CUDA before install so setup.py can detect NCCL version
python -c "import torch; torch.cuda.init(); print(f'NCCL version: {torch.cuda.nccl.version()}')"
# Use --no-build-isolation so setup.py can access pre-installed PyTorch
pip install --no-build-isolation .
pip install --no-build-isolation .[dev,1bit,autotuning,deepcompile]
ds_report
# Debug: Check captured torch_info values
python -c "from deepspeed.git_version_info import torch_info; print(f'torch_info: {torch_info}')"

- name: Run unit tests
- name: Python environment
run: |
pip list

- name: Unit tests
run: |
pytest -n 4 --forked --verbose tests/unit/v1/ --torch_ver=${{ env.TORCH_VER }} --cuda_ver=${{ env.CUDA_VER }}
export TORCH_CUDA_ARCH_LIST="8.9"
cd tests
# Skip tests requiring unavailable hardware or known issues:
# - nvme checkpointing: no nvme device
# - GDS tests: no GPUDirect Storage support
# - launcher user_args: pdsh requires SSH server
# - zenflow: Stage 3 tests have pre-existing bugs + CUDA/fork issues
rm -rf /mnt/aio/pytest
pytest --instafail --timeout 600 --forked -n 8 --basetemp=/mnt/aio/pytest unit/ \
--ignore=unit/runtime/zero/test_nvme_checkpointing.py \
--ignore=unit/ops/aio/test_gds.py \
--ignore=unit/launcher/test_user_args.py \
--ignore=unit/runtime/zenflow \
--ignore=unit/ops/adam/test_zf_torch_adam.py \
--torch_ver=${{ env.TORCH_VER }} --cuda_ver=${{ env.CUDA_VER }}
rm -rf /mnt/aio/pytest
pytest --instafail --timeout 600 --forked -m 'sequential' --basetemp=/mnt/aio/pytest unit/ \
--ignore=unit/runtime/zero/test_nvme_checkpointing.py \
--ignore=unit/ops/aio/test_gds.py \
--ignore=unit/launcher/test_user_args.py \
--ignore=unit/runtime/zenflow \
--ignore=unit/ops/adam/test_zf_torch_adam.py \
--torch_ver=${{ env.TORCH_VER }} --cuda_ver=${{ env.CUDA_VER }}
4 changes: 3 additions & 1 deletion deepspeed/runtime/base_optimizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,9 @@ def scale_if_loss(self, value: Any) -> Any:
return self.external_loss_scale * value
if self.torch_autocast_gradscaler:
return self.torch_autocast_gradscaler.scale(value)
return self.loss_scaler.scale_loss(value)
# Only call loss_scaler if it exists (not present in BF16_Optimizer)
if hasattr(self, 'loss_scaler') and self.loss_scaler is not None:
return self.loss_scaler.scale_loss(value)

return value

Expand Down
5 changes: 5 additions & 0 deletions deepspeed/runtime/bf16_optimizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,11 @@ def __init__(self,
], f"BF16Optimizer: Unsupported gradient accumulation data type: {grad_acc_dtype}"
self.grad_acc_dtype = grad_acc_dtype

# BF16 doesn't use loss scaling, but these attributes are needed for API compatibility
self.custom_loss_scaler = False
self.external_loss_scale = None
self.torch_autocast_gradscaler = None

self.immediate_grad_update = bfloat16_config.immediate_grad_update

self.clip_grad = clip_grad
Expand Down
4 changes: 3 additions & 1 deletion deepspeed/runtime/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,9 @@
BFLOAT16_OPTIMIZER_STATES_DEFAULT = False

# DDP variant of BFLOAT16
DDP_BFLOAT16 = "bf16"
# DDP variant: bf16 model with bf16 grad accumulation (uses FP16_Optimizer in bf16 mode)
# Must be different from BFLOAT16 to allow proper optimizer selection
DDP_BFLOAT16 = "ddp_bf16"

#########################################
# FP16 support
Expand Down
9 changes: 7 additions & 2 deletions deepspeed/runtime/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -2934,6 +2934,11 @@ def _get_gradients_for_reduction(self):
if not param.requires_grad:
continue

# Skip empty parameters (numel=0) as they contribute nothing to gradient reduction
# and cause issues with flatten/unflatten operations
if param.numel() == 0:
continue

if param.grad is None:
# In cases where there is an imbalance of empty grads across
# ranks we must create empty grads, this will ensure that every
Expand Down Expand Up @@ -3414,7 +3419,7 @@ def _load_checkpoint(self,
if self.optimizer is not None and hasattr(self.optimizer, 'refresh_fp32_params'):
self.optimizer.refresh_fp32_params()
else:
has_zero_optimizer_state = self.zero_optimization() or self.bfloat16_enabled()
has_zero_optimizer_state = self.zero_optimization()
if load_optimizer_states and self.optimizer is not None and not has_zero_optimizer_state:
if self.has_moe_layers:
largest_group_name = groups._get_max_expert_size_name()
Expand Down Expand Up @@ -3883,7 +3888,7 @@ def _save_checkpoint(self, save_dir, tag, client_state={}, exclude_frozen_parame

save_path = self._get_ckpt_name(save_dir, tag)

zero_optimizer_state = self.zero_optimization() or self.bfloat16_enabled()
zero_optimizer_state = self.zero_optimization()

save_frozen_param = self.zero_optimization_partition_gradients() and not exclude_frozen_parameters

Expand Down
23 changes: 16 additions & 7 deletions deepspeed/runtime/pipe/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -852,13 +852,22 @@ def _exec_backward_pass(self, buffer_id):
# manually call because we don't call optimizer.backward()
self.optimizer.clear_lp_grads()

# This handles either a single tensor or tuple of tensors.
if isinstance(outputs, tuple):
out_tensors = [t for t in outputs if t.is_floating_point()]
assert len(out_tensors) == len(grad_tensors)
torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors)
else:
torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
# Set _running_engine_backward to avoid RuntimeError in post-backward hook
# when needs_scaler=True (the hook checks this flag to skip error checking)
self._running_engine_backward = True
try:
# Use tensor.backward(gradient) style which is now supported by DeepSpeed.
# This properly integrates with DeepSpeed's hooks and loss scaling.
if isinstance(outputs, tuple):
out_tensors = [t for t in outputs if t.is_floating_point()]
assert len(out_tensors) == len(grad_tensors)
# For multiple tensors, use retain_graph for all but the last
for i, (out, grad) in enumerate(zip(out_tensors, grad_tensors)):
out.backward(gradient=grad, retain_graph=(i < len(out_tensors) - 1))
else:
outputs.backward(gradient=grad_tensors)
finally:
self._running_engine_backward = False

if self.using_bf16_optimizer and not self.is_last_stage():
# manually call because we don't call optimizer.backward()
Expand Down
9 changes: 8 additions & 1 deletion docs/_tutorials/ds4sci_evoformerattention.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,22 @@ tags: training inference

### 3.1 Installation

`DS4Sci_EvoformerAttention` is released as part of DeepSpeed >= 0.10.3. `DS4Sci_EvoformerAttention` is implemented based on [CUTLASS](https://github.com/NVIDIA/cutlass). You need to clone the CUTLASS repository and specify the path to it in the environment variable `CUTLASS_PATH`.
`DS4Sci_EvoformerAttention` is released as part of DeepSpeed >= 0.10.3.

`DS4Sci_EvoformerAttention` is implemented based on [CUTLASS](https://github.com/NVIDIA/cutlass). You need to clone the CUTLASS repository and specify the path to it in the environment variable `CUTLASS_PATH`.
CUTLASS setup detection can be ignored by setting ```CUTLASS_PATH="DS_IGNORE_CUTLASS_DETECTION"```, which is useful if you have a well setup compiler (e.g., compiling in a conda package with cutlass and the cuda compilers installed).
CUTLASS location can be automatically inferred using pypi's [nvidia-cutlass](https://pypi.org/project/nvidia-cutlass/) package by setting ```CUTLASS_PATH="DS_USE_CUTLASS_PYTHON_BINDINGS"```. Note that this is discouraged as ```nvidia-cutlass``` is not maintained anymore and outdated.

You can always simply clone cutlass and setup ```CUTLASS_PATH```:
```shell
git clone https://github.com/NVIDIA/cutlass
export CUTLASS_PATH=/path/to/cutlass
```
The kernels will be compiled when `DS4Sci_EvoformerAttention` is called for the first time.

`DS4Sci_EvoformerAttention` requires GPUs with compute capability 7.0 or higher (NVIDIA V100 or later GPUs) and the minimal CUDA version is 11.3. It is recommended to use CUDA 11.7 or later for better performance. Besides, the performance of backward kernel on V100 kernel is not as good as that on A100 for now.
The extension checks both requirements and fails if any is not met. To disable the check, for example for cross-compiling in a system without GPUs, you can set the environment variable ```DS_IGNORE_CUDA_DETECTION=TRUE```
and the environment value ```DS_EVOFORMER_GPU_ARCH={70|75|80}```, which controls the target GPU (80 being the last supported and meaning NVIDIA Ampere and later).

### 3.2 Unit test and benchmark

Expand Down
Loading
Loading