Skip to content

feat: delta compression for weight sync#1806

Open
nanjiangwill wants to merge 3 commits intomainfrom
delta-compression-feature
Open

feat: delta compression for weight sync#1806
nanjiangwill wants to merge 3 commits intomainfrom
delta-compression-feature

Conversation

@nanjiangwill
Copy link
Copy Markdown
Collaborator

@nanjiangwill nanjiangwill commented Apr 5, 2026

Inspiration / prior art

Summary

Add optional delta-compression for trainer → rollout-engine weight sync. When enabled, the trainer keeps a pinned-CPU baseline of each parameter, broadcasts only the sparse delta (bf16 packed indices + values) over NCCL, and the SGLang receiver decodes + applies the delta in large amortized batches.

At GLM-4.7-355B-A32B non-colocated (4 actor + 4 rollout nodes, 64 rollout H100s), this drops delta-sync time from ~50s (non-delta baseline, full 170 GB broadcast) to ~22.6s - ~2.2× faster.

For the colocated case we do not expect measurable gains: that path uses CUDA IPC (GPU-to-GPU memory sharing in the same process group), not NCCL broadcast, so there is no wire to compress. Delta compute would be pure overhead there. This PR leaves the colocated CUDA-IPC path unchanged in behavior — the tracker integration in update_weight_from_tensor.py is present for correctness but expected to be a no-op win for colocated runs.

New CLI flags

  • --enable-delta-compression — master switch
  • --delta-compression-dtype {fp16, bf16, fp32} — delta dtype on the wire (default fp32)
  • --delta-compression-transport {dense, sparse_indices, sparse_bitmask} — on-wire encoding (default dense; recommended sparse_indices)
  • --delta-compression-full-sync-interval N — run a full sync every N successful delta syncs; step 0 is always full
  • --delta-compression-artifact-dir PATH — optional async zstd artifact writer (off by default)

Design choices

1. Sparse-indices transport

Deltas between consecutive RL steps are extremely sparse (~2-3% density at 355B). Sending them dense would waste ~97% of the wire. We pack non-zero (index, value) pairs into two flat tensors and a small metadata list describing per-param slice ranges. Compression ratio observed: ~30× on the wire (170 GB dense → ~5.9 GB sparse).

2. Secondary CUDA stream for baseline D2H

The trainer must keep a pinned-CPU baseline up to date. If D2H runs on the default stream, it serializes with delta compute and the next chunk's gather. We route all baseline D2H through a dedicated _d2h_stream, recording a single event to gate on gather/convert. Both full and delta paths share the same _snapshot_baseline_async helper (see commit "delta tracker: unify baseline D2H across full and delta paths" in history).

3. Inline commit — no GPU refs held

baseline_updates=[] in DeltaCompressionCommitState — gathered GPU tensors are freed as soon as prepare_chunk returns, so the next bucket's gather has headroom. Prior iterations that held refs in commit state caused allocator stalls at 355B scale.

4. Receiver batch cap = 512 MiB

SGLang's model.load_weights(batch) call costs ~2 ms/call in name resolution + MoE expert remapping — so the receiver batches decoded deltas until they exceed 512 MiB before flushing to load_weights. Swept systematically at GLM-4.7-355B H100 64-rollout:

Batch cap load_calls per sync Avg delta sync
96 MiB 1110 37.8s
256 MiB ~400 ~32s
512 MiB 128 30.3s ← chosen
1024 MiB OOM on some engines

512 MiB won decisively and is hardcoded with the sweep rationale inlined in the patch comment.

5. bf16 deltas with fp32 baseline

The baseline stays fp32 on CPU (accumulator), but the on-wire delta is bf16. Rounding drift is flushed periodically via --delta-compression-full-sync-interval — set to 10000 in practice (basically never) because the baseline is exact-refreshed each step anyway.

Experimental results

Primary benchmark — GLM-4.7-355B-A32B non-colocated (4+4 nodes, 64 H100 rollout GPUs)

Phase Time Wire Notes
Step 0 full sync 66–83s 182 GB dense includes one-time ~15s pinned-CPU baseline allocation
Step 1+ delta sync 22.6s ~5.9 GB sparse full_chunks=0, delta_chunks=598, transport=sparse_indices, batch_mib=512
Periodic full sync (pre-refactor, commit 1c4935bf) step 0 = 66.3s, step 2 = 69.7s (+3.4s over step 0) 182 GB dense D2H on default stream, held GPU refs in commit state. Step 2 paid full D2H + broadcast serially.
Periodic full sync (post-refactor, commit 7ed11a92, run #2) step 0 = 63.3s, step 2 = 41.4s (−21.9s under step 0) 182 GB dense _snapshot_baseline_async moves D2H to secondary stream, baseline_updates=[] frees gathered buffers. D2H now overlaps with NCCL broadcast; step 2 runs at pure-non-delta speed. Measured ~28s faster than pre-refactor on comparable step-0 runtimes (69.7s → 41.4s).
Non-delta baseline (reference) ~50s 182 GB dense for comparison

Correctness validation

Ran a validate config (check_weight_update_equal=True, full_sync_interval=2) to exercise full → delta → forced-full transitions against a snapshotted reference on each SGLang engine:

Step Path Result
0 Full sync weights_checker 200 OK on every engine; [check_tensors] equal tensors listed every layer
1 Delta sync weights_checker 200 OK on every engine — bitwise/tolerance match
2 Forced full (interval=2) weights_checker 200 OK on every engine

Experiments conducted during development

# Experiment Conclusion
1 Transport encoding: dense vs sparse_bitmask vs sparse_indices sparse_indices wins — bitmask incurs unpack cost on the receiver
2 Baseline storage: GPU vs pinned CPU Pinned CPU required at 355B (GPU can't hold 170 GB baseline)
3 D2H placement: default stream vs secondary stream Secondary stream hides ~7-10s of PCIe cost under NCCL
4 Commit state: hold GPU refs vs inline commit Inline commit required to avoid memory pressure at 355B
5 Bucket size sweep (sender sparse): 500 MB / 1 GB / 5 GB / 10 GB / 20 GB 5 GB sparse bucket chosen — plateau beyond, larger bumps peak memory
6 Receiver batch cap sweep: 96 / 256 / 512 / 1024 MiB 512 MiB chosen (table above)
7 Full-sync-interval: 2, 100, 10000 10000 (effectively never) — bf16 drift negligible with per-step inline baseline refresh
8 Correctness: check_weight_update_equal full→delta→forced-full All paths produce matching weights on rollout engines

Colocated case

The colocated weight-sync path (update_weight_from_tensor.py) uses CUDA IPC: the sender registers GPU memory and shares an IPC handle; the colocated SGLang engine maps that memory directly into its own process. There is no broadcast over the network — the "wire" is a pointer, not bytes.

Consequences:

  • Delta compression's main benefit (30× wire reduction) buys nothing here.
  • Delta compute itself (H2D baseline load + subtract + sparse encode + D2H) is pure overhead for colocated runs.
  • This PR still wires the tracker into update_weight_from_tensor.py for API consistency and so that future mixed colocated+distributed setups get consistent baseline semantics, but we do not claim any colocated perf gain, and users running fully colocated should leave --enable-delta-compression off.

Test plan

  • Non-colocated GLM-4.7-355B end-to-end: step 0 full, step 1+ delta, periodic full — measured and validated
  • Correctness via check_weight_update_equal / weights_checker: all three paths match reference
  • Receiver patch applies cleanly and all patched SGLang files compile (py_compile)
  • Smaller models (Qwen3-30B, Qwen3-VL) — sparse_indices expected to work the same; not benchmarked in this PR
  • Colocated runs — not expected to improve; leave flag off

🤖 Generated with Claude Code

@nanjiangwill nanjiangwill marked this pull request as draft April 5, 2026 04:33
@nanjiangwill nanjiangwill force-pushed the delta-compression-feature branch 2 times, most recently from f220c86 to fb890b3 Compare April 18, 2026 01:00
Add delta-compression support for weight sync from trainer to rollout
engines. The trainer holds a pinned-CPU baseline, computes deltas on
GPU, snapshots the new baseline back to CPU on a secondary stream, and
sparse-encodes deltas for broadcast.

  - CLI flags in slime/utils/arguments.py:
      --enable-delta-compression
      --delta-compression-dtype {fp16, bf16, fp32}
      --delta-compression-transport {dense, sparse_indices, sparse_bitmask}
      --delta-compression-full-sync-interval N
      --delta-compression-artifact-dir PATH
  - DeltaCompressionTracker in delta_weight_update.py — owns baseline,
    issues async D2H on a secondary CUDA stream, computes deltas.
  - Shared helpers in common.py (ChunkUpdate, materialize_delta_transport,
    dtype map).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nanjiangwill nanjiangwill force-pushed the delta-compression-feature branch 2 times, most recently from a3c070f to d2aa1c0 Compare April 18, 2026 01:19
nanjiangwill and others added 2 commits April 18, 2026 01:26
Integrate DeltaCompressionTracker into both weight-update paths:
  - update_weight_from_distributed.py (non-colocated): per-param gather
    + HF convert feeds into prepare_chunk, then sparse-encoded deltas
    are bucketed into NCCL broadcasts. A sparse bucket flush threshold
    (5 GB) keeps per-broadcast cost bounded.
  - update_weight_from_tensor.py (colocated): same tracker, simpler
    flush via load_format="flattened_bucket_delta".
  - sglang_engine.py: update_weights_from_distributed accepts the new
    sparse_metadata kwargs so the engine side can decode sparse payload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds _apply_sparse_delta_weights_from_distributed to the SGLang
ModelRunner. Receiver decodes packed (indices, values) on-wire into
persistent scratch buffers and calls load_weights in batches capped
at 512 MiB to amortize the ~2ms/call name-resolution + MoE expert
remapping overhead.

Validated at GLM-4.7-355B (bf16, sparse_indices, 64 rollout GPUs):
  - Step 0 full sync:  66-83s — weights_checker 200 OK on all engines
  - Step 1 delta sync: 22.6s — weights_checker 200 OK (all tensors match)
  - Step 2 forced full: 69.7s — weights_checker 200 OK

Batch cap sweep justifying 512 MiB inlined in the patch itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nanjiangwill nanjiangwill force-pushed the delta-compression-feature branch from d2aa1c0 to 5fb928d Compare April 18, 2026 01:26
@nanjiangwill nanjiangwill marked this pull request as ready for review April 18, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant