Ring + Ulysses 2D context parallelism#404
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
d6638b6 to
2582cf1
Compare
48a900e to
d10bf07
Compare
|
🤖 Hi @csgoogle, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
d10bf07 to
6521956
Compare
There was a problem hiding this comment.
This Pull Request introduces a hybrid Ulysses + Ring 2D context parallelism for attention, which is a significant addition for handling long sequences in large-scale diffusion models. The implementation correctly sets up the internal mesh and handles the necessary all_to_all communications. However, there are some concerns regarding the masking logic and the handling of the attention_mask parameter in the new kernel that should be addressed to ensure correctness across all use cases.
🔍 General Feedback
- Positive Highlights: The refactoring of
Attentionclasses to use a more flexibleattention_configdictionary is a great improvement for maintainability and reduces the complexity of method signatures. - Robustness: The validation checks for
ulysses_shardsand mesh axes are well-implemented and will help users avoid misconfiguration. - Testing: New tests for the Ulysses-Ring layout are included, although more comprehensive tests covering the actual attention computation with padding would further improve confidence.
| "block_kv_dkv_compute" : 512, | ||
| "block_q_dq" : 512, | ||
| "block_kv_dq" : 512, | ||
| "block_q" : 2048, |
There was a problem hiding this comment.
🟡 block_q and block_kv were increased from 512 to 2048 in this config. Is this change required for the new ulysses_ring attention kernel or is it a general performance optimization? It might be worth documenting why this specific model size received this update, or applying it to others if it's a general improvement.
|
|
||
| block_sizes = _select_flash_block_sizes(query, key, flash_block_sizes, dtype, "tokamax_ring") | ||
|
|
||
| q_axis_names = nn.logical_to_mesh_axes(axis_names_q) |
There was a problem hiding this comment.
🟠 The use of np.tile(kv_shard_valid, num_ring_shards) assumes that every shard in the ring has an identical valid/padding pattern. In sequence parallelism, padding is typically only at the end of the global sequence (affecting only the last shard). Tiling a local mask will incorrectly mask valid tokens in earlier shards or fail to mask padding tokens in the last shard correctly.
Consider calculating the mask based on global sequence positions:
# Example logic for global masking
ring_idx = jax.lax.axis_index(ring_axis)
q_global_start = ring_idx * query_seq_len
# ... logic to create a mask that correctly accounts for orig_q_seq_len across the ringThere was a problem hiding this comment.
thanks, fixed it.
6521956 to
5c4d053
Compare
|
🤖 Hi @csgoogle, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This PR successfully implements the ulysses_ring attention mode, a sophisticated 2D parallelism strategy that combines Ulysses all-to-all head sharding with Ring KV rotation. The implementation is clean, well-integrated into the existing WAN model infrastructure, and includes a comprehensive suite of unit tests verifying both functional correctness and validation logic.
🔍 General Feedback
- Elegant Mesh Reshaping: The internal reshaping of the
contextmesh axis into privateringandulyssesaxes withinshard_mapis a great architectural choice, keeping the public configuration simple. - Robust Masking: The global sequence masking logic in the hybrid kernel correctly handles padding, which is critical for accuracy in sequence-parallel settings.
- Thorough Testing: The addition of detailed round-trip and validation tests in
attention_test.pyis excellent and ensures the new feature is robust against edge cases.
5c4d053 to
14b997a
Compare
14b997a to
a219866
Compare
|
🤖 Hi @csgoogle, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This PR introduces a robust implementation of the ulysses_ring attention mode, effectively combining Ulysses all-to-all head sharding with Ring-based KV rotation for WAN models. The implementation is well-architected, leveraging internal mesh reshaping and shard_map to provide a clean public interface while handling complex 2D parallelism internally.
🔍 General Feedback
- High Quality Implementation: The use of internal axes (
ring,ulysses) withinshard_mapis a clever and effective way to manage 2D parallelism. - Comprehensive Testing: The added unit tests are thorough and cover important edge cases, including global sequence padding and validation logic.
- Plumbing Consistency: The configuration plumbing is consistently applied across all WAN model variants and pipelines.
- Safety: The fallback mechanism for cross-attention and the rigorous validation of shard divisibility ensure the stability of the new feature.
| for ulysses_ring_attention_axis_rule in ULYSSES_RING_ATTENTION_AXIS_RULES: | ||
| if ulysses_ring_attention_axis_rule not in logical_axis_rules: | ||
| max_logging.log(f"Adding ulysses ring attention axis rule {ulysses_ring_attention_axis_rule}") | ||
| new_rules.append(ulysses_ring_attention_axis_rule) |
There was a problem hiding this comment.
| if attention_kernel == "tokamax_ring" and not is_self_attention: | ||
| attention_kernel = "tokamax_flash" # do not use ring attention for cross attention | ||
| if attention_kernel in ("tokamax_ring", "ulysses_ring") and not is_self_attention: | ||
| attention_kernel = "tokamax_flash" |
There was a problem hiding this comment.
| kv_ring_indices = kv_indices // kv_padded_len | ||
| kv_global_indices = kv_ring_indices * key_seq_len + kv_ring_offsets | ||
| kv_valid = (kv_ring_offsets < key_seq_len) & (kv_global_indices < orig_kv_seq_len) | ||
| mask = tokamax_splash_attention_mask.NumpyMask(q_valid[:, None] & kv_valid[None, :]) |
There was a problem hiding this comment.
Description
This PR adds support for a new
ulysses_ringattention mode for WAN models. The implementation keeps the public sequence sharding on the existingcontextmesh axis, then internally reshapes that axis into privateringandulyssesaxes so the attention path can combine Ulysses all-to-all head sharding with ring-based KV rotation.Changes
ulysses_ringattention kernel registration and routing.contextinto hiddenringandulyssesaxes.ulysses_shardsconfig plumbing through WAN pipeline, WAN transformer blocks, and attention ops.ulysses_ringsupport and addulysses_shards.Testing
src/maxdiffusion/tests/attention_test.pyforulysses_ringbehavior and validation.