NVIDIA
diff --git a/‎.github/workflows/build.yml‎
Lines changed: 4 additions & 0 deletions b/‎.github/workflows/build.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎.github/workflows/docs.yml‎
Lines changed: 4 additions & 0 deletions b/‎.github/workflows/docs.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎.github/workflows/lint.yml‎
Lines changed: 4 additions & 0 deletions b/‎.github/workflows/lint.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/examples/attention/attention.ipynb‎
Lines changed: 2 additions & 1 deletion b/‎docs/examples/attention/attention.ipynb‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/examples/attention/cp_ag_thd_dpa_jax_deep_dive.ipynb‎
Lines changed: 1 addition & 0 deletions b/‎docs/examples/attention/cp_ag_thd_dpa_jax_deep_dive.ipynb‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/jax/collective_gemm/common.py‎
Lines changed: 0 additions & 4 deletions b/‎examples/jax/collective_gemm/common.py‎
Lines changed: 0 additions & 4 deletions
diff --git a/‎examples/jax/collective_gemm/test_gemm.py‎
Lines changed: 1 addition & 4 deletions b/‎examples/jax/collective_gemm/test_gemm.py‎
Lines changed: 1 addition & 4 deletions
diff --git a/‎examples/jax/collective_gemm/test_layernorm_mlp_grad.py‎
Lines changed: 0 additions & 2 deletions b/‎examples/jax/collective_gemm/test_layernorm_mlp_grad.py‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎examples/jax/encoder/run_test_multiprocessing_encoder.sh‎
Lines changed: 0 additions & 4 deletions b/‎examples/jax/encoder/run_test_multiprocessing_encoder.sh‎
Lines changed: 0 additions & 4 deletions
diff --git a/‎examples/jax/encoder/test_model_parallel_encoder.py‎
Lines changed: 0 additions & 68 deletions b/‎examples/jax/encoder/test_model_parallel_encoder.py‎
Lines changed: 0 additions & 68 deletions
@@ -7,6 +7,10 @@ name: 'Build'
 on:
   pull_request:
   workflow_dispatch:
+concurrency:
+  # Group by workflow name + PR number (for PRs) or ref (for branch/tag pushes)
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
 jobs:
   core:
     name: 'Core'
 
@@ -8,6 +8,10 @@ on:
   pull_request:
   workflow_dispatch:
   workflow_call:
+concurrency:
+  # Group by workflow name + PR number (for PRs) or ref (for branch/tag pushes)
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
 jobs:
   build_docs:
     name: 'Build'
 
@@ -7,6 +7,10 @@ name: 'Lint'
 on:
   pull_request:
   workflow_dispatch:
+concurrency:
+  # Group by workflow name + PR number (for PRs) or ref (for branch/tag pushes)
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
 jobs:
   pytorch_cpplint:
     name: 'PyTorch C++'
 
@@ -151,6 +151,7 @@
     "- flash-attention does not support `post_scale_bias`, and cuDNN attention does.\n",
     "- flash-attention supports KV-caching and paged attention, and cuDNN attention does not.\n",
     "- flash-attention uses bottom right diagonal for `causal` mask in cross attention (see [change log](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#21-change-behavior-of-causal-flag)), and cuDNN attention supports both top left and bottom right.\n",
+    "- **Sliding window attention (SWA):** flash-attention has SWA(left, right) support for all mask types except top-left causal masks, with or without dropout, and without bias. cuDNN attention supports SWA(left, 0) starting from 9.2 and SWA(left, right) starting from 9.6, without dropout, and with `bias_type=\"no_bias\"`.\n",
     "- flash-attention outperforms cuDNN attention on Ampere architectures, and cuDNN attention has 20-50% advantages on Hopper architectures, based on our benchmarks for a number of commonly-used model configurations.\n",
     "\n",
     "To compare cuDNN attention and flash-attention, users can modify the `model_configs` dictionary in [benchmarks/attention/benchmark_attention.py](https://github.com/NVIDIA/TransformerEngine/blob/main/benchmarks/attention/benchmark_attention.py) to collect performance numbers. The script runs each entry in `model_configs` for `num_iters` times, each time with one forward pass and one backward pass. Both backends are tried, and if one backend does not have support for the specific user input, the runtimes and speedups in the final table would be 0."
@@ -389,7 +390,7 @@
     "\n",
     "| Attention Backend | Precision | Architecture | Sliding Window Attention | MQA/GQA | Multi-Latent Attention | Context Parallelism | Determinism Possible |\n",
     "| :---------------- | :-------- | :----------- | :----------------------- | :------ | :--------------------- | :------------------ | :------------ |\n",
-    "| cuDNN attention (all frameworks) | BF16, FP16, FP8 (PyTorch only) |  sm80+ | No  | Yes | Yes | Yes (`bshd`,`sbhd`, `thd`) | Yes |\n",
+    "| cuDNN attention (all frameworks) | BF16, FP16, FP8 (PyTorch only) |  sm80+ | Yes (cuDNN 9.2+) | Yes | Yes | Yes (`bshd`,`sbhd`, `thd`) | Yes |\n",
     "| flash-attention (PyTorch)           | BF16, FP16      |  sm80+ | Yes | Yes | Yes | Yes (`bshd`,`thd`)  | Yes                                                                                    |\n",
     "| Framework-native attention | BF16, FP16, FP32 |  Any   | No, unless used as a mask  | Yes | Yes (PyTorch only) | No                                  | Yes |\n",
     "\n",
 
@@ -28,6 +28,7 @@
    "source": [
     "### Question 1: Why choose Striped>1 ?\n",
     "\n",
+    "\n",
     "Prior to the addition of this feature, Transformer Engine JAX attention already supported load balancing via a striping pattern, i.e., `stripe_size=1` for `CP + THD + P2P(Ring) + Striped + SWA`. However, this reordering technique does not lend itself well to an all-gathered (post-AG) pattern. The following example illustrates this distinction. For this example, `cp_size=4`, `num_segments=4`, `window_size=(8,0)`, and the pattern is for a single rank after striped reordering has been performed: \n",
     "\n",
     "#### I. Striped (`stripe_size=1`)\n",
 
@@ -131,10 +131,6 @@ def _initialize_distributed(args):
     )
 
     _distributed_initialized = True
-    jax.clear_caches()
-    jax.config.update(
-        "jax_use_shardy_partitioner", False
-    )  # CollectiveGEMM does not work with Shardy yet
 
     assert jax.local_device_count() == 1, (
         f"[{args.process_id}|{args.num_devices_per_process}] Expected 1 GPU per process, found"
 
@@ -88,8 +88,6 @@ def _jitted_cgemm(x, weight, bias, contracting_dims, collective_op, output_shard
 def run_gemm_tests(args, mesh=None):
     """Execute GEMM tests."""
     print(args)
-    # Collective GEMM requires Shardy partitioner to be disabled
-    jax.config.update("jax_use_shardy_partitioner", False)
 
     # Initialize distributed with provided arguments
     _initialize_distributed(args)
@@ -137,8 +135,7 @@ def run_gemm_tests(args, mesh=None):
             bias_sharded,
             contracting_dims=((2,), (0,)),
             collective_op=collective_op,
-            # CollectiveGEMM output should have a correct sharding without applying sharding constraint
-            output_sharding=None,
+            output_sharding=output_sharding,
         )
         assert (
             ref_output.sharding == output.sharding
 
@@ -119,8 +119,6 @@ def _value_and_grad_layernorm_mlp(
 def run_layernorm_mlp_grad_tests(args, mesh=None):
     """Execute Dense Gradient tests."""
     print(args)
-    # Collective GEMM requires Shardy partitioner to be disabled
-    jax.config.update("jax_use_shardy_partitioner", False)
 
     # Initialize distributed with provided arguments
     _initialize_distributed(args)
 
@@ -11,10 +11,6 @@ TEST_CASES=(
 "test_te_current_scaling_fp8"
 "test_te_mxfp8"
 "test_te_nvfp4"
-"test_te_bf16_shardy"
-"test_te_delayed_scaling_fp8_shardy"
-"test_te_current_scaling_fp8_shardy"
-"test_te_nvfp4_shardy"
 )
 
 : ${TE_PATH:=/opt/transformerengine}
 
@@ -239,7 +239,6 @@ def check_fp8(state, var_collect, inputs, masks, labels):
 def train_and_evaluate(args):
     """Execute model training and evaluation loop."""
     print(args)
-    jax.config.update("jax_use_shardy_partitioner", args.enable_shardy)
 
     train_ds, test_ds, num_embed = get_datasets(args.max_seq_len)
 
@@ -474,9 +473,6 @@ def encoder_parser(args):
     parser.add_argument(
         "--enable-sp", action="store_true", default=False, help="Enable sequence parallelism."
     )
-    parser.add_argument(
-        "--enable-shardy", action="store_true", default=False, help="Enable Shardy (experimental)."
-    )
 
     return parser.parse_args(args)
 
@@ -559,70 +555,6 @@ def test_te_nvfp4_with_sp(self):
         actual = train_and_evaluate(self.args)
         assert actual[0] < 0.40 and actual[1] > 0.82
 
-    @unittest.skipIf(not is_bf16_supported(), "Device compute capability 8.0+ is required for BF16")
-    def test_te_bf16_shardy(self):
-        """Test Transformer Engine with BF16"""
-        self.args.enable_shardy = True
-        actual = train_and_evaluate(self.args)
-        assert actual[0] < 0.36 and actual[1] > 0.84
-
-    @unittest.skipIf(not is_fp8_supported, fp8_reason)
-    def test_te_delayed_scaling_fp8_shardy(self):
-        """Test Transformer Engine with DelayedScaling FP8"""
-        self.args.enable_shardy = True
-        self.args.use_fp8 = True
-        self.args.fp8_recipe = "DelayedScaling"
-        actual = train_and_evaluate(self.args)
-        assert actual[0] < 0.362 and actual[1] > 0.84
-
-    @unittest.skipIf(not is_fp8_supported, fp8_reason)
-    def test_te_delayed_scaling_fp8_with_sp_shardy(self):
-        """Test Transformer Engine with DelayedScaling FP8 + SP"""
-        self.args.enable_shardy = True
-        self.args.enable_sp = True
-        self.args.use_fp8 = True
-        self.args.fp8_recipe = "DelayedScaling"
-        actual = train_and_evaluate(self.args)
-        assert actual[0] < 0.362 and actual[1] > 0.84
-
-    @unittest.skipIf(not is_mxfp8_supported, mxfp8_reason)
-    def test_te_mxfp8_shardy(self):
-        """Test Transformer Engine with MXFP8"""
-        self.args.enable_shardy = True
-        self.args.use_fp8 = True
-        self.args.fp8_recipe = "MXFP8BlockScaling"
-        actual = train_and_evaluate(self.args)
-        assert actual[0] < 0.36 and actual[1] > 0.84
-
-    @unittest.skipIf(not is_nvfp4_supported, nvfp4_reason)
-    def test_te_nvfp4_shardy(self):
-        """Test Transformer Engine with NVFP4"""
-        self.args.enable_shardy = True
-        self.args.use_fp8 = True
-        self.args.fp8_recipe = "NVFP4BlockScaling"
-        actual = train_and_evaluate(self.args)
-        assert actual[0] < 0.40 and actual[1] > 0.82
-
-    @unittest.skipIf(not is_mxfp8_supported, mxfp8_reason)
-    def test_te_mxfp8_with_sp_shardy(self):
-        """Test Transformer Engine with MXFP8 + SP"""
-        self.args.enable_shardy = True
-        self.args.enable_sp = True
-        self.args.use_fp8 = True
-        self.args.fp8_recipe = "MXFP8BlockScaling"
-        actual = train_and_evaluate(self.args)
-        assert actual[0] < 0.36 and actual[1] > 0.84
-
-    @unittest.skipIf(not is_nvfp4_supported, nvfp4_reason)
-    def test_te_nvfp4_with_sp_shardy(self):
-        """Test Transformer Engine with NVFP4"""
-        self.args.enable_shardy = True
-        self.args.enable_sp = True
-        self.args.use_fp8 = True
-        self.args.fp8_recipe = "NVFP4BlockScaling"
-        actual = train_and_evaluate(self.args)
-        assert actual[0] < 0.40 and actual[1] > 0.82
-
 
 if __name__ == "__main__":
     train_and_evaluate(encoder_parser(None))
Original file line number	Diff line number	Diff line change
`@@ -131,10 +131,6 @@ def _initialize_distributed(args):`
`131`	`131`	`)`
`132`	`132`
`133`	`133`	`_distributed_initialized = True`
`134`		`- jax.clear_caches()`
`135`		`- jax.config.update(`
`136`		`- "jax_use_shardy_partitioner", False`
`137`		`- ) # CollectiveGEMM does not work with Shardy yet`
`138`	`134`
`139`	`135`	`assert jax.local_device_count() == 1, (`
`140`	`136`	`f"[{args.process_id}\|{args.num_devices_per_process}] Expected 1 GPU per process, found"`