Skip to content

Speed test 0519#1809

Open
HAOCHENYE wants to merge 26 commits into
mainfrom
speed_test_0519
Open

Speed test 0519#1809
HAOCHENYE wants to merge 26 commits into
mainfrom
speed_test_0519

Conversation

@HAOCHENYE
Copy link
Copy Markdown
Collaborator

No description provided.

CyCle1024 and others added 26 commits May 19, 2026 06:52
The H2D copy of tokens_per_expert can trigger a CPU sync and delay the second micro batch dispatch launch.

In the intended overlap timeline, batch 2 dispatch should be ready once batch 1 dispatch finishes and batch 2 attention forward completes. Instead it is held until after the first permute finishes, and compile activity can slip in between and push the actual launch even later.
Add split Qwen3.5 MoE text/model configs and export the new compose config so XTuner can work with per-expert checkpoints directly.

Also add a converter that rewrites HuggingFace fused expert safetensors into split tensors while preserving shard grouping, which improves checkpoint save/load efficiency for the Qwen3.5 MoE path.
Align MTPLayer/MTPBlock forward signatures with MoEDecoderLayer so MTP
can be invoked with a list of hidden states / seq_ctx / position_embeddings.
Multi-microbatch path issues a single underlying decoder forward per MTP
depth, allowing the MoE EP dispatch/combine to overlap across micro-batches.
Update the moe.py domino-EP path to call the MTP block once across all
micro-batches instead of looping per micro-batch.
Wrap the Qwen3VLVisionModel block loop with async_save_on_cpu under the
XTUNER_ACTIVATION_OFFLOAD switch, mirroring InternS1VisionEncoder.
…E forward

torch.cat(..., dim=1) was a text-only assumption (2D position_ids
[batch, seq]); Qwen3-VL M-RoPE uses 3D position_ids [axes, batch, seq],
where dim=1 is batch and the seq dim is dim=-1. The hard-coded dim=1
produced wrong-length cos/sin under intra_layer_micro_batch>1, surfacing
as a shape mismatch in apply_rotary_emb.
1. Only trace the python stack
2. set MEMORY_SNAPSHOT_MAX_ENTRIES to 10000000
3. profile data prepare and optimizer step progress
…hard_after_forward for MoE embed

Add a per-model `fully_shard` flag to `Qwen3VLVisionConfig` and `Qwen3VLProjectorConfig`
that controls whether the corresponding submodule is wrapped with FSDP at all
(previously the flag controlled `reshard_after_forward`). When the flag is False
the module stays unsharded and the related layer/projector prefetch wiring is
skipped, since FSDP prefetch APIs only apply to FSDP-wrapped modules.

Also keep the MoE embed_tokens `reshard_after_forward` configurable via
`MoEConfig.embed_reshard_after_forward`: compose models call `embed_tokens`
multiple times per step, so the default avoids repeated all-gathers.
`SequenceContext.to(device)` used to move the full `pixel_values` tensor
to GPU before any sequence-parallel split. In image/video-heavy batches
this caused a single 4-5 GB H2D per step, hurting both peak memory and
step time.

Keep `pixel_values` on CPU in `SequenceContext.to`. The Qwen3-VL vision
model now splits the full tensor along the SP dimension first and only
moves the local rank's slice to device. `split_for_sequence_parallel`
gains an optional `split_size` parameter so callers can drive the split
with a model-aware target length and handle padding themselves.
Cache pinned CPU buffers in OffloadManager keyed by (group, block, tensor_idx)
so repeated iterations reuse the same buffer and avoid pin_memory reallocation.
…_grad

Bucket replicated DTensor gradients by their flattened process group and issue
one coalesced all_reduce per group via dist._coalescing_manager instead of one
NCCL launch per parameter. Reduces collective launch overhead for models with
many small replicated tensors.
Older torch versions hit a fullgraph compile failure on the projector forward,
so previously only the vision layer was compiled. Gate the projector entry on
Version(torch.__version__) >= 2.9.1 and fall back to DEFAULT_FLOAT8_CFG only
for older runtimes.
`packaging` is imported by sphinx before `autodoc_mock_imports` takes effect,
so a module-level gate like `Version(torch.__version__) >= Version("2.9.1")`
was handing a `_MockObject` to the real `Version()` and raising `TypeError`.
Patch `_MockModule.__getattr__` and `_MockObject.__getattr__` to return a
parseable "0.0.0" placeholder for `__version__`. Also mock `causal_conv1d_cuda`
which has the same import-order problem.

Add MTP loss context and SP support design notes under docs/design/.
Move per-dataset loading/sampling messages from info to debug so the
stderr stream stays focused on training progress, while the rank log
file still captures the full detail by always enabling DEBUG. Also
include the dataset path in the damaged/over-length sample warnings to
make filtering issues easier to trace.
`Trainer._print_training_config` dumps the full training config to the
rank-0 log; with large dataset_cfg/dataloader_cfg payloads the printed
JSON grows huge and drowns out the model-related fields. Skip those two
sections and keep the model configuration readable.
Many config-time warnings, env detection notices, deprecation messages
and one-shot startup info logs were emitted by every distributed rank,
duplicating identical content across stderr and per-rank log files.
Introduce a small `log_rank0` proxy in `xtuner/v1/utils/logger.py` so
callers can opt in to "emit only on rank 0" with a single call, instead
of scattering `if get_rank() == 0:` guards at every site. Wraps loguru's
logger and uses `opt(depth=1)` to keep caller info pointing at the real
site. Migrate ~90 existing rank-symmetric `logger.info`/`logger.warning`
calls across datasets, engine, trainer, model, profiler and utils to use
`log_rank0`. Per-sample, per-rank or per-process logs (NaN detection,
per-sample tokenize errors, per-rank profiler stats, JsonlWriter errors,
the per-step training metrics line) are left untouched.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants