Speed test 0519 by HAOCHENYE · Pull Request #1809 · InternLM/xtuner

HAOCHENYE · 2026-05-19T10:05:11Z

No description provided.

The H2D copy of tokens_per_expert can trigger a CPU sync and delay the second micro batch dispatch launch. In the intended overlap timeline, batch 2 dispatch should be ready once batch 1 dispatch finishes and batch 2 attention forward completes. Instead it is held until after the first permute finishes, and compile activity can slip in between and push the actual launch even later.

Add split Qwen3.5 MoE text/model configs and export the new compose config so XTuner can work with per-expert checkpoints directly. Also add a converter that rewrites HuggingFace fused expert safetensors into split tensors while preserving shard grouping, which improves checkpoint save/load efficiency for the Qwen3.5 MoE path.

Align MTPLayer/MTPBlock forward signatures with MoEDecoderLayer so MTP can be invoked with a list of hidden states / seq_ctx / position_embeddings. Multi-microbatch path issues a single underlying decoder forward per MTP depth, allowing the MoE EP dispatch/combine to overlap across micro-batches. Update the moe.py domino-EP path to call the MTP block once across all micro-batches instead of looping per micro-batch.

Wrap the Qwen3VLVisionModel block loop with async_save_on_cpu under the XTUNER_ACTIVATION_OFFLOAD switch, mirroring InternS1VisionEncoder.

…E forward torch.cat(..., dim=1) was a text-only assumption (2D position_ids [batch, seq]); Qwen3-VL M-RoPE uses 3D position_ids [axes, batch, seq], where dim=1 is batch and the seq dim is dim=-1. The hard-coded dim=1 produced wrong-length cos/sin under intra_layer_micro_batch>1, surfacing as a shape mismatch in apply_rotary_emb.

1. Only trace the python stack 2. set MEMORY_SNAPSHOT_MAX_ENTRIES to 10000000 3. profile data prepare and optimizer step progress

…hard_after_forward for MoE embed Add a per-model `fully_shard` flag to `Qwen3VLVisionConfig` and `Qwen3VLProjectorConfig` that controls whether the corresponding submodule is wrapped with FSDP at all (previously the flag controlled `reshard_after_forward`). When the flag is False the module stays unsharded and the related layer/projector prefetch wiring is skipped, since FSDP prefetch APIs only apply to FSDP-wrapped modules. Also keep the MoE embed_tokens `reshard_after_forward` configurable via `MoEConfig.embed_reshard_after_forward`: compose models call `embed_tokens` multiple times per step, so the default avoids repeated all-gathers.

`SequenceContext.to(device)` used to move the full `pixel_values` tensor to GPU before any sequence-parallel split. In image/video-heavy batches this caused a single 4-5 GB H2D per step, hurting both peak memory and step time. Keep `pixel_values` on CPU in `SequenceContext.to`. The Qwen3-VL vision model now splits the full tensor along the SP dimension first and only moves the local rank's slice to device. `split_for_sequence_parallel` gains an optional `split_size` parameter so callers can drive the split with a model-aware target length and handle padding themselves.

Cache pinned CPU buffers in OffloadManager keyed by (group, block, tensor_idx) so repeated iterations reuse the same buffer and avoid pin_memory reallocation.

…_grad Bucket replicated DTensor gradients by their flattened process group and issue one coalesced all_reduce per group via dist._coalescing_manager instead of one NCCL launch per parameter. Reduces collective launch overhead for models with many small replicated tensors.

Older torch versions hit a fullgraph compile failure on the projector forward, so previously only the vision layer was compiled. Gate the projector entry on Version(torch.__version__) >= 2.9.1 and fall back to DEFAULT_FLOAT8_CFG only for older runtimes.

`packaging` is imported by sphinx before `autodoc_mock_imports` takes effect, so a module-level gate like `Version(torch.__version__) >= Version("2.9.1")` was handing a `_MockObject` to the real `Version()` and raising `TypeError`. Patch `_MockModule.__getattr__` and `_MockObject.__getattr__` to return a parseable "0.0.0" placeholder for `__version__`. Also mock `causal_conv1d_cuda` which has the same import-order problem. Add MTP loss context and SP support design notes under docs/design/.

Move per-dataset loading/sampling messages from info to debug so the stderr stream stays focused on training progress, while the rank log file still captures the full detail by always enabling DEBUG. Also include the dataset path in the damaged/over-length sample warnings to make filtering issues easier to trace.

`Trainer._print_training_config` dumps the full training config to the rank-0 log; with large dataset_cfg/dataloader_cfg payloads the printed JSON grows huge and drowns out the model-related fields. Skip those two sections and keep the model configuration readable.

Many config-time warnings, env detection notices, deprecation messages and one-shot startup info logs were emitted by every distributed rank, duplicating identical content across stderr and per-rank log files. Introduce a small `log_rank0` proxy in `xtuner/v1/utils/logger.py` so callers can opt in to "emit only on rank 0" with a single call, instead of scattering `if get_rank() == 0:` guards at every site. Wraps loguru's logger and uses `opt(depth=1)` to keep caller info pointing at the real site. Migrate ~90 existing rank-symmetric `logger.info`/`logger.warning` calls across datasets, engine, trainer, model, profiler and utils to use `log_rank0`. Per-sample, per-rank or per-process logs (NaN detection, per-sample tokenize errors, per-rank profiler stats, JsonlWriter errors, the per-step training metrics line) are left untouched.

CyCle1024 and others added 26 commits May 19, 2026 06:52

tmp success in full gragh

f73c7c5

[Fix] Cache SP rank in SequenceContext for torch.compile compatibility

45e50e2

domino ep intralayer

de738a9

use xtuner library name

62326fe

[Feature] Add activation offload to Qwen3-VL vision encoder

6af9737

Wrap the Qwen3VLVisionModel block loop with async_save_on_cpu under the XTUNER_ACTIVATION_OFFLOAD switch, mirroring InternS1VisionEncoder.

[Enhance] Update the profile default behavior

8fd50d1

1. Only trace the python stack 2. set MEMORY_SNAPSHOT_MAX_ENTRIES to 10000000 3. profile data prepare and optimizer step progress

fix: handle 0 token dispatched in MoEBlock when ep_size > 1

0a97be4

[Refactor] Migrate rope_scaling_cfg to rope_parameters_cfg

3c3e159

support fla v0.4.2 torch compile

9eebdd1

opt: use torch.searchsorted in gen_seq_idx

b532626

[Feature] Add reserve_pin_memory option to async_save_on_cpu

1acb085

Cache pinned CPU buffers in OffloadManager keyed by (group, block, tensor_idx) so repeated iterations reuse the same buffer and avoid pin_memory reallocation.

[Config] Disable FSDP wrap for Qwen3.5 397B vision/projector

30671a5

[Fix] fix deepep_op backward when rcv zero tokens (#1799)

975c017

[Fix] Fix rope theta bug

a997364

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed test 0519#1809

Speed test 0519#1809
HAOCHENYE wants to merge 26 commits into
mainfrom
speed_test_0519

HAOCHENYE commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HAOCHENYE commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants