feat: delta compression for weight sync by nanjiangwill · Pull Request #1806 · THUDM/slime

nanjiangwill · 2026-04-05T04:32:01Z

Inspiration / prior art

Cursor: Composer 2
Fireworks: Frontier RL Is Cheaper Than You Think

Summary

Add optional delta-compression for trainer → rollout-engine weight sync. When enabled, the trainer keeps a pinned-CPU baseline of each parameter, broadcasts only the sparse delta (bf16 packed indices + values) over NCCL, and the SGLang receiver decodes + applies the delta in large amortized batches.

At GLM-4.7-355B-A32B non-colocated (4 actor + 4 rollout nodes, 64 rollout H100s), this drops delta-sync time from ~50s (non-delta baseline, full 170 GB broadcast) to ~22.6s - ~2.2× faster.

For the colocated case we do not expect measurable gains: that path uses CUDA IPC (GPU-to-GPU memory sharing in the same process group), not NCCL broadcast, so there is no wire to compress. Delta compute would be pure overhead there. This PR leaves the colocated CUDA-IPC path unchanged in behavior — the tracker integration in update_weight_from_tensor.py is present for correctness but expected to be a no-op win for colocated runs.

New CLI flags

--enable-delta-compression — master switch
--delta-compression-dtype {fp16, bf16, fp32} — delta dtype on the wire (default fp32)
--delta-compression-transport {dense, sparse_indices, sparse_bitmask} — on-wire encoding (default dense; recommended sparse_indices)
--delta-compression-full-sync-interval N — run a full sync every N successful delta syncs; step 0 is always full
--delta-compression-artifact-dir PATH — optional async zstd artifact writer (off by default)

Design choices

1. Sparse-indices transport

Deltas between consecutive RL steps are extremely sparse (~2-3% density at 355B). Sending them dense would waste ~97% of the wire. We pack non-zero (index, value) pairs into two flat tensors and a small metadata list describing per-param slice ranges. Compression ratio observed: ~30× on the wire (170 GB dense → ~5.9 GB sparse).

2. Secondary CUDA stream for baseline D2H

The trainer must keep a pinned-CPU baseline up to date. If D2H runs on the default stream, it serializes with delta compute and the next chunk's gather. We route all baseline D2H through a dedicated _d2h_stream, recording a single event to gate on gather/convert. Both full and delta paths share the same _snapshot_baseline_async helper (see commit "delta tracker: unify baseline D2H across full and delta paths" in history).

3. Inline commit — no GPU refs held

baseline_updates=[] in DeltaCompressionCommitState — gathered GPU tensors are freed as soon as prepare_chunk returns, so the next bucket's gather has headroom. Prior iterations that held refs in commit state caused allocator stalls at 355B scale.

4. Receiver batch cap = 512 MiB

SGLang's model.load_weights(batch) call costs ~2 ms/call in name resolution + MoE expert remapping — so the receiver batches decoded deltas until they exceed 512 MiB before flushing to load_weights. Swept systematically at GLM-4.7-355B H100 64-rollout:

Batch cap	`load_calls` per sync	Avg delta sync
96 MiB	1110	37.8s
256 MiB	~400	~32s
512 MiB	128	30.3s ← chosen
1024 MiB	—	OOM on some engines

512 MiB won decisively and is hardcoded with the sweep rationale inlined in the patch comment.

5. bf16 deltas with fp32 baseline

The baseline stays fp32 on CPU (accumulator), but the on-wire delta is bf16. Rounding drift is flushed periodically via --delta-compression-full-sync-interval — set to 10000 in practice (basically never) because the baseline is exact-refreshed each step anyway.

Experimental results

Primary benchmark — GLM-4.7-355B-A32B non-colocated (4+4 nodes, 64 H100 rollout GPUs)

Phase	Time	Wire	Notes
Step 0 full sync	66–83s	182 GB dense	includes one-time ~15s pinned-CPU baseline allocation
Step 1+ delta sync	22.6s	~5.9 GB sparse	`full_chunks=0, delta_chunks=598`, `transport=sparse_indices`, `batch_mib=512`
Periodic full sync (pre-refactor, commit `1c4935bf`)	step 0 = 66.3s, step 2 = 69.7s (+3.4s over step 0)	182 GB dense	D2H on default stream, held GPU refs in commit state. Step 2 paid full D2H + broadcast serially.
Periodic full sync (post-refactor, commit `7ed11a92`, run #2)	step 0 = 63.3s, step 2 = 41.4s (−21.9s under step 0)	182 GB dense	`_snapshot_baseline_async` moves D2H to secondary stream, `baseline_updates=[]` frees gathered buffers. D2H now overlaps with NCCL broadcast; step 2 runs at pure-non-delta speed. Measured ~28s faster than pre-refactor on comparable step-0 runtimes (69.7s → 41.4s).
Non-delta baseline (reference)	~50s	182 GB dense	for comparison

Correctness validation

Ran a validate config (check_weight_update_equal=True, full_sync_interval=2) to exercise full → delta → forced-full transitions against a snapshotted reference on each SGLang engine:

Step	Path	Result
0	Full sync	`weights_checker 200 OK` on every engine; `[check_tensors] equal tensors` listed every layer
1	Delta sync	`weights_checker 200 OK` on every engine — bitwise/tolerance match
2	Forced full (interval=2)	`weights_checker 200 OK` on every engine

Experiments conducted during development

#	Experiment	Conclusion
1	Transport encoding: dense vs sparse_bitmask vs sparse_indices	`sparse_indices` wins — bitmask incurs unpack cost on the receiver
2	Baseline storage: GPU vs pinned CPU	Pinned CPU required at 355B (GPU can't hold 170 GB baseline)
3	D2H placement: default stream vs secondary stream	Secondary stream hides ~7-10s of PCIe cost under NCCL
4	Commit state: hold GPU refs vs inline commit	Inline commit required to avoid memory pressure at 355B
5	Bucket size sweep (sender sparse): 500 MB / 1 GB / 5 GB / 10 GB / 20 GB	5 GB sparse bucket chosen — plateau beyond, larger bumps peak memory
6	Receiver batch cap sweep: 96 / 256 / 512 / 1024 MiB	512 MiB chosen (table above)
7	Full-sync-interval: 2, 100, 10000	10000 (effectively never) — bf16 drift negligible with per-step inline baseline refresh
8	Correctness: `check_weight_update_equal` full→delta→forced-full	All paths produce matching weights on rollout engines

Colocated case

The colocated weight-sync path (update_weight_from_tensor.py) uses CUDA IPC: the sender registers GPU memory and shares an IPC handle; the colocated SGLang engine maps that memory directly into its own process. There is no broadcast over the network — the "wire" is a pointer, not bytes.

Consequences:

Delta compression's main benefit (30× wire reduction) buys nothing here.
Delta compute itself (H2D baseline load + subtract + sparse encode + D2H) is pure overhead for colocated runs.
This PR still wires the tracker into update_weight_from_tensor.py for API consistency and so that future mixed colocated+distributed setups get consistent baseline semantics, but we do not claim any colocated perf gain, and users running fully colocated should leave --enable-delta-compression off.

Test plan

Non-colocated GLM-4.7-355B end-to-end: step 0 full, step 1+ delta, periodic full — measured and validated
Correctness via check_weight_update_equal / weights_checker: all three paths match reference
Receiver patch applies cleanly and all patched SGLang files compile (py_compile)
Smaller models (Qwen3-30B, Qwen3-VL) — sparse_indices expected to work the same; not benchmarked in this PR
Colocated runs — not expected to improve; leave flag off

🤖 Generated with Claude Code

Add delta-compression support for weight sync from trainer to rollout engines. The trainer holds a pinned-CPU baseline, computes deltas on GPU, snapshots the new baseline back to CPU on a secondary stream, and sparse-encodes deltas for broadcast. - CLI flags in slime/utils/arguments.py: --enable-delta-compression --delta-compression-dtype {fp16, bf16, fp32} --delta-compression-transport {dense, sparse_indices, sparse_bitmask} --delta-compression-full-sync-interval N --delta-compression-artifact-dir PATH - DeltaCompressionTracker in delta_weight_update.py — owns baseline, issues async D2H on a secondary CUDA stream, computes deltas. - Shared helpers in common.py (ChunkUpdate, materialize_delta_transport, dtype map). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Integrate DeltaCompressionTracker into both weight-update paths: - update_weight_from_distributed.py (non-colocated): per-param gather + HF convert feeds into prepare_chunk, then sparse-encoded deltas are bucketed into NCCL broadcasts. A sparse bucket flush threshold (5 GB) keeps per-broadcast cost bounded. - update_weight_from_tensor.py (colocated): same tracker, simpler flush via load_format="flattened_bucket_delta". - sglang_engine.py: update_weights_from_distributed accepts the new sparse_metadata kwargs so the engine side can decode sparse payload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds _apply_sparse_delta_weights_from_distributed to the SGLang ModelRunner. Receiver decodes packed (indices, values) on-wire into persistent scratch buffers and calls load_weights in batches capped at 512 MiB to amortize the ~2ms/call name-resolution + MoE expert remapping overhead. Validated at GLM-4.7-355B (bf16, sparse_indices, 64 rollout GPUs): - Step 0 full sync: 66-83s — weights_checker 200 OK on all engines - Step 1 delta sync: 22.6s — weights_checker 200 OK (all tensors match) - Step 2 forced full: 69.7s — weights_checker 200 OK Batch cap sweep justifying 512 MiB inlined in the patch itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nanjiangwill marked this pull request as draft April 5, 2026 04:33

nanjiangwill force-pushed the delta-compression-feature branch 2 times, most recently from f220c86 to fb890b3 Compare April 18, 2026 01:00

nanjiangwill force-pushed the delta-compression-feature branch 2 times, most recently from a3c070f to d2aa1c0 Compare April 18, 2026 01:19

nanjiangwill and others added 2 commits April 18, 2026 01:26

nanjiangwill force-pushed the delta-compression-feature branch from d2aa1c0 to 5fb928d Compare April 18, 2026 01:26

nanjiangwill marked this pull request as ready for review April 18, 2026 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: delta compression for weight sync#1806

feat: delta compression for weight sync#1806
nanjiangwill wants to merge 3 commits intomainfrom
delta-compression-feature

nanjiangwill commented Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nanjiangwill commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Inspiration / prior art

Summary

New CLI flags

Design choices

1. Sparse-indices transport

2. Secondary CUDA stream for baseline D2H

3. Inline commit — no GPU refs held

4. Receiver batch cap = 512 MiB

5. bf16 deltas with fp32 baseline

Experimental results

Primary benchmark — GLM-4.7-355B-A32B non-colocated (4+4 nodes, 64 H100 rollout GPUs)

Correctness validation

Experiments conducted during development

Colocated case

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nanjiangwill commented Apr 5, 2026 •

edited

Loading